Day 079 — Project: Worker Pool & Dedup¶
Month 3 · Week 4 · ⬅ Day 078 · Day 080 ➡ · Journal index
🎯 Learning Objective¶
Turn yesterday's single fetch into a bounded concurrent crawler: a fixed pool of fetch workers plus a deduplication strategy so each URL is crawled exactly once, with no data race and clean termination.
📚 Topics¶
- Worker pool over a
unseenLinkschannel; bounding fetch concurrency - Dedup: single-owner
seenmap vs. mutex-guardedSet - Termination via an outstanding-work counter (
n)
📖 Reading / Sources¶
- The Go Programming Language §8.6 (crawler with bounded parallelism)
- Go blog — Pipelines and cancellation
-
sync.Mutexdocs
📝 Notes¶
- Bound the fetches. An unbounded
go fetch(url)per link would spawn millions of goroutines and exhaust sockets/file handles. A fixed pool of workers ranging over anunseenLinkschannel caps concurrency at N → [[worker-pool]] · [[bounded-concurrency]]. - Dedup, two idiomatic ways:
- Single owner (my choice for the crawler): one goroutine owns the
seen map[string]bool; workers send discovered links back to it over a channel. No mutex because only one goroutine touches the map → [[single-owner]] · [[share-memory-by-communicating]]. - Shared
Set(the exercise): async.Mutex-guarded set withMarkSeen(url) booldoing check-and-insert atomically. Use this when many goroutines need direct access → [[mutex]]. - Termination is the hard part. The owner keeps a counter
nof link-lists sent toworklistbut not yet processed: each received list doesn--; each newly-seen link dispatched doesn++. Whenn == 0there is no outstanding work, so closeunseenLinksand the workers'rangeloops end → [[graceful-termination]]. - A worker must never block the owner: after fetching, it ships results back with
go func(){ worklist <- found }()so the worker is free to take the next link and the owner is free to keep receiving → [[channel-deadlock]]. - The check-and-insert in a shared set must be one critical section. Two separate calls (
HasthenAdd) race: two goroutines both see "absent" and both crawl the URL → [[check-then-act]]. - Map writes from multiple goroutines without synchronisation are a data race and can corrupt the runtime's map (fatal "concurrent map writes"), not just give a wrong answer → [[data-race]].
💻 Code Examples¶
The owner loop that does dedup + termination with no mutex (from the example):
go func() { worklist <- seeds }()
n := 1 // one outstanding list: the seeds
seen := make(map[string]bool)
for ; n > 0; n-- {
for _, link := range <-worklist {
if seen[link] {
continue
}
seen[link] = true
n++ // one more list will come back for this link
unseenLinks <- link // hand to a free worker; workers fetch in parallel
}
}
close(unseenLinks) // no outstanding work -> let workers' range loops finish
Full code:
examples/month-03/crawler/· Run with race:go run -race ./examples/month-03/crawler
🏋️ Exercises / Practice¶
| Exercise | Status | Link |
|---|---|---|
Goroutine-safe visited set (MarkSeen exactly-one-winner) |
✅ | exercises/month-03/week-4/visited/ |
| Bounded concurrent crawler (each page fetched once) | ✅ | exercises/month-03/week-4/crawl/ |
🐛 Mistakes Made¶
- Had a worker do
worklist <- founddirectly (not in a goroutine) → deadlock: the owner was blocked sending the next link onunseenLinkswhile the worker was blocked sending results onworklist. Wrapping the send ingo func(){…}()broke the cycle. - In the shared
SetI first wroteif !s.Has(u) { s.Add(u) }— two lock acquisitions, racy. Folded it into one lockedMarkSeen.
❓ Open Questions¶
- For a huge crawl, is the single-owner map a bottleneck? (It can be; shard the dedup set or use
sync.Mapif the owner becomes hot. Measure first — Day 083.)
🧠 Active Recall (answer without looking)¶
-
Q: How does the single-owner design avoid a mutex on
seen?
A
Exactly one goroutine ever reads or writes the map; workers communicate discovered links back to it over a channel. No sharing means no lock needed — share memory by communicating. -
Q: Why must
MarkSeendo the check and the insert under one lock?
A
Otherwise two goroutines can both observe the URL as absent, then both insert and crawl it. A single critical section makes "is it new?" and "record it" atomic, guaranteeing exactly one winner.
🪶 Feynman Reflection¶
The pool is a turnstile: no matter how many links pour in, only N fetchers are ever past the gate. Dedup is the bouncer with a guest list — a URL gets in once. I picked the design where one person is the guest list (single owner over channels) so there is no clipboard to fight over (no lock), and I track outstanding work with a counter so the party ends exactly when the last guest has been processed.
🕳️ Knowledge Gaps¶
- When to prefer
sync.Mapover a mutex+map for the shared dedup variant — revisit with profiling.
✅ Summary¶
The crawler now fetches concurrently but bounded by a worker pool, dedups every URL (single-owner map, or a sync.Mutex Set in the exercise), and terminates cleanly via an outstanding-work counter. Verified -race-clean.
⏭️ Next Steps / Prep for Tomorrow¶
- Day 080: add
contextcancellation and graceful shutdown so the crawl can be stopped on a signal or deadline.
| Time spent | Difficulty | Confidence |
|---|---|---|
| 90 min | 🟦🟦⬜⬜⬜ | 🟦🟦🟦⬜⬜ |
Suggested commit: feat(examples): crawler worker pool & dedup (day 079)