Skip to content

Day 079 — Project: Worker Pool & Dedup

Month 3 · Week 4 · ⬅ Day 078 · Day 080 ➡ · Journal index

🎯 Learning Objective

Turn yesterday's single fetch into a bounded concurrent crawler: a fixed pool of fetch workers plus a deduplication strategy so each URL is crawled exactly once, with no data race and clean termination.

📚 Topics

  • Worker pool over a unseenLinks channel; bounding fetch concurrency
  • Dedup: single-owner seen map vs. mutex-guarded Set
  • Termination via an outstanding-work counter (n)

📖 Reading / Sources

📝 Notes

  • Bound the fetches. An unbounded go fetch(url) per link would spawn millions of goroutines and exhaust sockets/file handles. A fixed pool of workers ranging over an unseenLinks channel caps concurrency at N → [[worker-pool]] · [[bounded-concurrency]].
  • Dedup, two idiomatic ways:
  • Single owner (my choice for the crawler): one goroutine owns the seen map[string]bool; workers send discovered links back to it over a channel. No mutex because only one goroutine touches the map → [[single-owner]] · [[share-memory-by-communicating]].
  • Shared Set (the exercise): a sync.Mutex-guarded set with MarkSeen(url) bool doing check-and-insert atomically. Use this when many goroutines need direct access → [[mutex]].
  • Termination is the hard part. The owner keeps a counter n of link-lists sent to worklist but not yet processed: each received list does n--; each newly-seen link dispatched does n++. When n == 0 there is no outstanding work, so close unseenLinks and the workers' range loops end → [[graceful-termination]].
  • A worker must never block the owner: after fetching, it ships results back with go func(){ worklist <- found }() so the worker is free to take the next link and the owner is free to keep receiving → [[channel-deadlock]].
  • The check-and-insert in a shared set must be one critical section. Two separate calls (Has then Add) race: two goroutines both see "absent" and both crawl the URL → [[check-then-act]].
  • Map writes from multiple goroutines without synchronisation are a data race and can corrupt the runtime's map (fatal "concurrent map writes"), not just give a wrong answer → [[data-race]].

💻 Code Examples

The owner loop that does dedup + termination with no mutex (from the example):

go func() { worklist <- seeds }()
n := 1 // one outstanding list: the seeds
seen := make(map[string]bool)
for ; n > 0; n-- {
    for _, link := range <-worklist {
        if seen[link] {
            continue
        }
        seen[link] = true
        n++                 // one more list will come back for this link
        unseenLinks <- link // hand to a free worker; workers fetch in parallel
    }
}
close(unseenLinks) // no outstanding work -> let workers' range loops finish

Full code: examples/month-03/crawler/ · Run with race: go run -race ./examples/month-03/crawler

🏋️ Exercises / Practice

Exercise Status Link
Goroutine-safe visited set (MarkSeen exactly-one-winner) exercises/month-03/week-4/visited/
Bounded concurrent crawler (each page fetched once) exercises/month-03/week-4/crawl/

🐛 Mistakes Made

  • Had a worker do worklist <- found directly (not in a goroutine) → deadlock: the owner was blocked sending the next link on unseenLinks while the worker was blocked sending results on worklist. Wrapping the send in go func(){…}() broke the cycle.
  • In the shared Set I first wrote if !s.Has(u) { s.Add(u) } — two lock acquisitions, racy. Folded it into one locked MarkSeen.

❓ Open Questions

  • For a huge crawl, is the single-owner map a bottleneck? (It can be; shard the dedup set or use sync.Map if the owner becomes hot. Measure first — Day 083.)

🧠 Active Recall (answer without looking)

  1. Q: How does the single-owner design avoid a mutex on seen?

    A Exactly one goroutine ever reads or writes the map; workers communicate discovered links back to it over a channel. No sharing means no lock needed — share memory by communicating.

  2. Q: Why must MarkSeen do the check and the insert under one lock?

    A Otherwise two goroutines can both observe the URL as absent, then both insert and crawl it. A single critical section makes "is it new?" and "record it" atomic, guaranteeing exactly one winner.

🪶 Feynman Reflection

The pool is a turnstile: no matter how many links pour in, only N fetchers are ever past the gate. Dedup is the bouncer with a guest list — a URL gets in once. I picked the design where one person is the guest list (single owner over channels) so there is no clipboard to fight over (no lock), and I track outstanding work with a counter so the party ends exactly when the last guest has been processed.

🕳️ Knowledge Gaps

  • When to prefer sync.Map over a mutex+map for the shared dedup variant — revisit with profiling.

✅ Summary

The crawler now fetches concurrently but bounded by a worker pool, dedups every URL (single-owner map, or a sync.Mutex Set in the exercise), and terminates cleanly via an outstanding-work counter. Verified -race-clean.

⏭️ Next Steps / Prep for Tomorrow

  • Day 080: add context cancellation and graceful shutdown so the crawl can be stopped on a signal or deadline.

Time spent Difficulty Confidence
90 min 🟦🟦⬜⬜⬜ 🟦🟦🟦⬜⬜

Suggested commit: feat(examples): crawler worker pool & dedup (day 079)