Skip to content

Day 078 — Project: Crawler Design & Fetch

Month 3 · Week 4 · ⬅ Day 077 · Day 079 ➡ · Journal index

🎯 Learning Objective

Kick off the Week 4 capstone — a concurrent web crawler — by designing the dataflow on paper first, then writing a single correct fetch step (HTTP GET with a context, parse links) before adding any concurrency.

📚 Topics

  • Crawler architecture: seeds → fetch → extract links → enqueue unseen → repeat
  • http.Client with context, timeouts, resp.Body.Close()
  • Separating the policy (what to crawl) from the mechanism (how to run it)

📖 Reading / Sources

📝 Notes

  • Design before goroutines: the crawler is a graph traversal — seeds in, fetch a node, discover edges (links), enqueue the unseen ones, stop when nothing is outstanding. Get this right single-threaded, then parallelise the fetches → [[concurrency-design]].
  • Keep fetch a pure-ish function func(ctx, url) ([]string, error) — input URL, output links + error. No shared state inside it, so it is trivially safe to run on N goroutines later → [[pure-functions]].
  • Always build requests with http.NewRequestWithContext(ctx, …) so a cancelled/timed-out context aborts the in-flight transfer. A bare http.Get can hang on a slow server forever → [[context-first-param]].
  • Always defer resp.Body.Close() and drain the body; leaking bodies exhausts connections in the http.Transport pool. Even on a non-2xx status you must close it → [[resource-cleanup]].
  • Three concerns to keep separate so each is testable: (1) fetch (I/O), (2) extract links (pure parsing), (3) scheduling/dedup (concurrency). Today is (1)+(2).
  • For the runnable, race-free example I model the "web" as an in-memory link graph and a fetch that just looks up the node — identical structure, no network, deterministic. Swap in real HTTP and nothing about the concurrency changes → [[seams-for-testing]].

💻 Code Examples

A single, cancellable fetch step (real net/http; the graph version lives in the example):

func fetch(ctx context.Context, url string) ([]string, error) {
    req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
    if err != nil {
        return nil, fmt.Errorf("build request %s: %w", url, err)
    }
    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        return nil, fmt.Errorf("get %s: %w", url, err)
    }
    defer resp.Body.Close() // always, on every path
    if resp.StatusCode != http.StatusOK {
        io.Copy(io.Discard, resp.Body) // drain so the conn can be reused
        return nil, fmt.Errorf("get %s: status %d", url, resp.StatusCode)
    }
    return extractLinks(resp.Body) // pure parser, unit-tested separately
}

Full concurrent crawler (stdlib, in-memory graph): examples/month-03/crawler/ · Run: go run ./examples/month-03/crawler

🏋️ Exercises / Practice

Exercise Status Link
Bounded concurrent crawler over a link graph (sorted, deterministic) exercises/month-03/week-4/crawl/

🐛 Mistakes Made

  • First draft used http.Get(url) directly — no way to cancel a slow request. Switched to NewRequestWithContext.
  • Forgot to drain the body on the error path; left it as defer Close() only, then added io.Copy(io.Discard, …) so the keep-alive connection is reusable.

❓ Open Questions

  • How polite should the crawler be — respect robots.txt, per-host limits? (Yes for a real crawler; out of scope for the stdlib capstone, noted for later.)

🧠 Active Recall (answer without looking)

  1. Q: Why build requests with http.NewRequestWithContext instead of http.Get?

    A The context lets a timeout or cancellation abort the in-flight request and free the connection. `http.Get` has no context, so a slow/hung server can block the goroutine indefinitely.

  2. Q: Why keep fetch free of shared state and dedup logic?

    A So it is a pure-ish function safe to run on many goroutines without locks, and so the I/O, parsing, and scheduling concerns can each be tested in isolation.

🪶 Feynman Reflection

A crawler is just breadth-first search over a graph you discover as you go: start at the seeds, "expand" a node by downloading it and reading its links, and keep expanding nodes you haven't seen. Today I built one clean "expand a node" step — download with a leash (context), close the body, return the links — and deliberately left the parallel scheduling for tomorrow.

🕳️ Knowledge Gaps

  • Real HTML link extraction (golang.org/x/net/html) — third-party; the stdlib example fakes it with a graph. Revisit when the project goes real.

✅ Summary

Designed the crawler as a graph traversal with three separable concerns and wrote a correct, cancellable single-page fetch. Concurrency starts tomorrow with a worker pool and dedup.

⏭️ Next Steps / Prep for Tomorrow

  • Day 079: bound the fetches with a worker pool and add a goroutine-safe visited set.

Time spent Difficulty Confidence
90 min 🟦🟦⬜⬜⬜ 🟦🟦🟦⬜⬜

Suggested commit: feat(examples): concurrent crawler design & fetch (day 078)