Day 078 — Project: Crawler Design & Fetch¶
Month 3 · Week 4 · ⬅ Day 077 · Day 079 ➡ · Journal index
🎯 Learning Objective¶
Kick off the Week 4 capstone — a concurrent web crawler — by designing the dataflow on paper first, then writing a single correct fetch step (HTTP GET with a context, parse links) before adding any concurrency.
📚 Topics¶
- Crawler architecture: seeds → fetch → extract links → enqueue unseen → repeat
http.Clientwithcontext, timeouts,resp.Body.Close()- Separating the policy (what to crawl) from the mechanism (how to run it)
📖 Reading / Sources¶
- The Go Programming Language §8.6 (concurrent web crawler)
-
net/httpdocs —Client,NewRequestWithContext -
contextdocs —WithTimeout - Go blog — Pipelines and cancellation
📝 Notes¶
- Design before goroutines: the crawler is a graph traversal — seeds in, fetch a node, discover edges (links), enqueue the unseen ones, stop when nothing is outstanding. Get this right single-threaded, then parallelise the fetches → [[concurrency-design]].
- Keep
fetcha pure-ish functionfunc(ctx, url) ([]string, error)— input URL, output links + error. No shared state inside it, so it is trivially safe to run on N goroutines later → [[pure-functions]]. - Always build requests with
http.NewRequestWithContext(ctx, …)so a cancelled/timed-out context aborts the in-flight transfer. A barehttp.Getcan hang on a slow server forever → [[context-first-param]]. - Always
defer resp.Body.Close()and drain the body; leaking bodies exhausts connections in thehttp.Transportpool. Even on a non-2xx status you must close it → [[resource-cleanup]]. - Three concerns to keep separate so each is testable: (1) fetch (I/O), (2) extract links (pure parsing), (3) scheduling/dedup (concurrency). Today is (1)+(2).
- For the runnable, race-free example I model the "web" as an in-memory link graph and a
fetchthat just looks up the node — identical structure, no network, deterministic. Swap in real HTTP and nothing about the concurrency changes → [[seams-for-testing]].
💻 Code Examples¶
A single, cancellable fetch step (real net/http; the graph version lives in the example):
func fetch(ctx context.Context, url string) ([]string, error) {
req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
if err != nil {
return nil, fmt.Errorf("build request %s: %w", url, err)
}
resp, err := http.DefaultClient.Do(req)
if err != nil {
return nil, fmt.Errorf("get %s: %w", url, err)
}
defer resp.Body.Close() // always, on every path
if resp.StatusCode != http.StatusOK {
io.Copy(io.Discard, resp.Body) // drain so the conn can be reused
return nil, fmt.Errorf("get %s: status %d", url, resp.StatusCode)
}
return extractLinks(resp.Body) // pure parser, unit-tested separately
}
Full concurrent crawler (stdlib, in-memory graph):
examples/month-03/crawler/· Run:go run ./examples/month-03/crawler
🏋️ Exercises / Practice¶
| Exercise | Status | Link |
|---|---|---|
| Bounded concurrent crawler over a link graph (sorted, deterministic) | ✅ | exercises/month-03/week-4/crawl/ |
🐛 Mistakes Made¶
- First draft used
http.Get(url)directly — no way to cancel a slow request. Switched toNewRequestWithContext. - Forgot to drain the body on the error path; left it as
defer Close()only, then addedio.Copy(io.Discard, …)so the keep-alive connection is reusable.
❓ Open Questions¶
- How polite should the crawler be — respect
robots.txt, per-host limits? (Yes for a real crawler; out of scope for the stdlib capstone, noted for later.)
🧠 Active Recall (answer without looking)¶
-
Q: Why build requests with
http.NewRequestWithContextinstead ofhttp.Get?
A
The context lets a timeout or cancellation abort the in-flight request and free the connection. `http.Get` has no context, so a slow/hung server can block the goroutine indefinitely. -
Q: Why keep
fetchfree of shared state and dedup logic?
A
So it is a pure-ish function safe to run on many goroutines without locks, and so the I/O, parsing, and scheduling concerns can each be tested in isolation.
🪶 Feynman Reflection¶
A crawler is just breadth-first search over a graph you discover as you go: start at the seeds, "expand" a node by downloading it and reading its links, and keep expanding nodes you haven't seen. Today I built one clean "expand a node" step — download with a leash (context), close the body, return the links — and deliberately left the parallel scheduling for tomorrow.
🕳️ Knowledge Gaps¶
- Real HTML link extraction (
golang.org/x/net/html) — third-party; the stdlib example fakes it with a graph. Revisit when the project goes real.
✅ Summary¶
Designed the crawler as a graph traversal with three separable concerns and wrote a correct, cancellable single-page fetch. Concurrency starts tomorrow with a worker pool and dedup.
⏭️ Next Steps / Prep for Tomorrow¶
- Day 079: bound the fetches with a worker pool and add a goroutine-safe visited set.
| Time spent | Difficulty | Confidence |
|---|---|---|
| 90 min | 🟦🟦⬜⬜⬜ | 🟦🟦🟦⬜⬜ |
Suggested commit: feat(examples): concurrent crawler design & fetch (day 078)