Table of Contents
- 04 — Concurrent Web Crawler
- Overview
- Learning Objectives
- Requirements
- Architecture
- Suggested Project Layout
- Data Model / Database
- API Design
- Tech Stack
- Implementation Milestones
- Testing Strategy
- Deployment
- Documentation Deliverables
- Stretch Goals / Future Improvements
- Lessons-Learned Prompts
- Portfolio & Resume
04 — Concurrent Web Crawler¶
Overview¶
A concurrent web crawler that, given one or more seed URLs, fetches pages, extracts links, and recursively follows them up to a configurable depth while staying within a single domain (or an allow-list of hosts). The crawler emits a sitemap of discovered URLs and the links found on each page.
The point of this project is not to build a Googlebot competitor — it is to
build a textbook example of a bounded, leak-free, cancellable concurrency
pipeline in idiomatic Go. You will use a fixed-size worker pool, channels for
the URL frontier and results, context.Context for cancellation and timeouts, a
per-host rate limiter, and proper graceful shutdown on SIGINT. Every goroutine
you start must have a clearly defined exit condition so that go test -race and
go.uber.org/goleak both come back clean.
Level: Intermediate. Expect to spend most of your time thinking about lifetimes — of goroutines, of channels, and of the crawl itself — rather than about HTML parsing.
Learning Objectives¶
By the end of this project you should be able to:
- Design a fan-out / fan-in pipeline using channels and a bounded worker pool.
- Choose a sensible degree of parallelism and bound it deliberately instead of spawning one goroutine per URL.
- Propagate cancellation and deadlines with
context.Contextso that a single cancel signal tears down the entire pipeline. - Implement graceful shutdown on
os.Interrupt/SIGINTwithout dropping in-flight work or leaking goroutines. - Use
golang.org/x/sync/errgroupand understand when it is preferable to a rawsync.WaitGroup. - Apply client-side rate limiting per host with
golang.org/x/time/rate. - Deduplicate work safely under concurrency (mutex-guarded map vs. owner goroutine) and reason about the trade-offs.
- Detect and eliminate goroutine leaks and data races with tooling.
- Parse HTML robustly with
golang.org/x/net/htmland resolve relative links. - Respect
robots.txtand a same-domain policy as a matter of crawler etiquette.
Requirements¶
Functional¶
- Seed URL(s): Accept one or more seed URLs from a flag (repeatable
-url) or a seeds file. Validate and normalize them before crawling. - Follow links to max depth: Parse each fetched page, extract
<a href>links, resolve them against the page's base URL, and enqueue them withdepth+1. Stop descending oncedepth > maxDepth. - Same-domain restriction: By default, only follow links whose host matches
the seed host (configurable to an allow-list of hosts, or "any host"). Reject
off-domain, non-HTTP(S), and
mailto:/javascript:links. - Deduplication: Never fetch the same normalized URL twice within a run.
- Output sitemap / links: Produce a machine-readable report mapping each crawled URL to its status code, discovered links, depth, and any error.
- robots.txt: Fetch and honor the host's
/robots.txt(Disallowrules andCrawl-delay) for the configured user-agent before crawling a path.
Non-Functional¶
- Configurable concurrency: Number of worker goroutines is set by
-workers(default e.g. 8). The pool size is fixed for the lifetime of the crawl. - Rate limit per host: Cap requests-per-second per host via
-rate, enforced with arate.Limiterso a single target is never hammered. - Bounded memory: The frontier and visited set must not grow without bound in pathological cases; channel buffers and the dedup set are the only unbounded structures and are bounded by the (finite) same-domain URL space at a depth.
- No goroutine leaks: Every goroutine exits when the crawl finishes or is
cancelled. Verified with
goleakin tests. - Graceful shutdown: A first
SIGINTcancels the context, lets in-flight fetches finish (or time out), flushes collected results, and exits cleanly. - Deterministic-ish output: Because crawl order is concurrent, the final report is sorted (by URL) so that runs over a static site graph produce byte-stable output suitable for golden-file tests.
Architecture¶
The crawler is a classic fan-out / fan-in pipeline. A single frontier
channel feeds N identical workers (fan-out); every worker writes to one shared
results channel that a single collector drains (fan-in). A context.Context
threads through every stage; cancelling it (timeout, error, or SIGINT) unblocks
every select and drains the pipeline.
A subtle point: the number of "outstanding" URLs (enqueued but not yet
processed) is tracked with a counter (an errgroup/WaitGroup or an atomic).
When it reaches zero, the producer side closes the frontier, which lets workers
return, which lets the collector finish. This is what makes the crawl terminate
rather than block forever on an unbuffered channel.
flowchart TD
SEED[Seed URLs] --> FRONTIER
subgraph PIPE[Crawl Pipeline]
FRONTIER([URL frontier channel]) -->|fan-out| W1[Worker 1]
FRONTIER -->|fan-out| W2[Worker 2]
FRONTIER -->|fan-out| WN[Worker N]
W1 --> FETCH
W2 --> FETCH
WN --> FETCH
FETCH[Fetcher: rate-limited http.Client] --> PARSE[Parser: extract links]
PARSE --> DEDUP{Visited set?}
DEDUP -->|new| FRONTIER
DEDUP -->|seen / off-domain / too deep| DROP[Discard]
W1 -->|fan-in| RESULTS
W2 -->|fan-in| RESULTS
WN -->|fan-in| RESULTS
end
RESULTS([Results channel]) --> COLLECTOR[Collector: build sitemap]
COLLECTOR --> OUT[(JSON / CSV sitemap)]
CTX[[context.Context cancel / timeout / SIGINT]] -.cancels.-> W1
CTX -.cancels.-> W2
CTX -.cancels.-> WN
CTX -.cancels.-> FETCH
CTX -.cancels.-> COLLECTOR
Fan-out: one frontier channel is read by N worker goroutines. Go's runtime load-balances receives across them, so faster workers naturally pick up more work. Parallelism is bounded exactly by N — no matter how many links a page has.
Fan-in: all workers send Result values into one results channel. A single
collector goroutine owns the sitemap map, so the map needs no lock — it is only
ever touched by its owner (the "share memory by communicating" idiom).
Cancellation: the rate limiter's Wait(ctx), the HTTP request
(http.NewRequestWithContext), and every channel select all observe the same
ctx. One cancel() call therefore unblocks the whole graph.
Suggested Project Layout¶
Following golang-standards/project-layout:
04-concurrent-web-crawler/
├── cmd/
│ └── crawler/
│ └── main.go # flag parsing, signal handling, wires it together
├── internal/
│ ├── crawler/
│ │ ├── crawler.go # Crawler type, Run(ctx), pipeline orchestration
│ │ ├── worker.go # worker goroutine loop (fetch -> parse -> enqueue)
│ │ └── frontier.go # URL frontier channel + visited dedup set
│ ├── fetch/
│ │ ├── client.go # rate-limited http.Client wrapper (Fetcher impl)
│ │ └── robots.go # robots.txt fetch + matching
│ ├── parse/
│ │ └── links.go # extract + resolve links via x/net/html (Parser)
│ └── config/
│ └── config.go # Config struct, flag binding, validation/defaults
├── pkg/
│ └── urlset/
│ └── urlset.go # reusable concurrent-safe visited set (optional)
├── testdata/
│ └── site/ # static HTML graph served by httptest in tests
├── go.mod
├── go.sum
├── Dockerfile
└── README.md
internal/ holds packages that should not be imported by other modules. Anything
genuinely reusable (e.g. a generic concurrent set) can graduate to pkg/.
Data Model / Database¶
This project is in-memory. There is no database in the core path; results may optionally be persisted to JSON or CSV at the end.
Visited set (deduplication). Two valid designs — pick one and justify it:
// Option A: mutex-guarded map (simple, fine for this scale).
type VisitedSet struct {
mu sync.Mutex
seen map[string]struct{}
}
func (v *VisitedSet) Add(url string) (added bool) {
v.mu.Lock()
defer v.mu.Unlock()
if _, ok := v.seen[url]; ok {
return false
}
v.seen[url] = struct{}{}
return true
}
Option B: a dedicated owner goroutine that owns the map and exposes add/check requests over a channel. No mutex; serialization is by goroutine ownership. Cleaner conceptually, slightly more plumbing. Discuss the trade-off in your README.
map[string]struct{} is used (not map[string]bool) because the value carries
no information — struct{}{} is zero-width.
Page / Result struct. What the collector accumulates:
type Result struct {
URL string `json:"url"`
StatusCode int `json:"status_code"`
Links []string `json:"links"`
Depth int `json:"depth"`
Error string `json:"error,omitempty"`
}
Frontier queue semantics. The frontier is a buffered channel of work items:
Enqueue only happens after VisitedSet.Add returns true and Depth <=
maxDepth and the host passes the same-domain check. A pending-work counter
(sync.WaitGroup or atomic int) is incremented on enqueue and decremented when
an item is fully processed; when it hits zero the frontier is closed and the
pipeline winds down. Optionally persist the final []Result to out.json /
out.csv.
API Design¶
CLI surface¶
crawler [flags]
-url string Seed URL to crawl (repeatable). Required.
-depth int Maximum link depth from the seed (default 2).
-workers int Number of concurrent worker goroutines (default 8).
-rate float Max requests per second per host (default 5).
-timeout duration Overall crawl timeout, e.g. 30s, 2m (default 60s).
-out string Output file path; .json or .csv (default "-" = stdout).
-same-host Restrict crawl to the seed host (default true).
-ua string User-Agent string sent with each request.
Example:
Internal interfaces¶
Small interfaces make the pipeline testable (swap in fakes, no network):
// Fetcher retrieves a single URL, honoring ctx for cancellation/timeout.
type Fetcher interface {
Fetch(ctx context.Context, url string) (status int, body io.ReadCloser, err error)
}
// Parser extracts absolute, resolved links from an HTML body.
type Parser interface {
Links(base string, body io.Reader) ([]string, error)
}
The real Fetcher wraps *http.Client plus a per-host *rate.Limiter; the test
Fetcher serves from an in-memory map or an httptest.Server.
Optional HTTP endpoint¶
A thin server can trigger crawls on demand:
POST /crawl {"url":"https://example.com","depth":2} -> 202 Accepted + job id
GET /crawl/{id} -> status + sitemap JSON
Example output (JSON sitemap)¶
{
"seed": "https://example.com",
"started_at": "2026-06-26T10:00:00Z",
"pages": [
{
"url": "https://example.com/",
"status_code": 200,
"depth": 0,
"links": ["https://example.com/about", "https://example.com/blog"]
},
{
"url": "https://example.com/about",
"status_code": 200,
"depth": 1,
"links": ["https://example.com/"]
},
{
"url": "https://example.com/blog",
"status_code": 404,
"depth": 1,
"links": [],
"error": "unexpected status 404"
}
]
}
Tech Stack¶
- Go (1.22+).
net/http— HTTP client with a tunedTransport(MaxIdleConnsPerHost, timeouts) andhttp.NewRequestWithContext.golang.org/x/net/html— streaming HTML tokenizer/parser for link extraction (no regex on HTML).golang.org/x/time/rate— token-bucket rate limiter, one per host.golang.org/x/sync/errgroup— bounded worker group with first-error propagation and context cancellation.context— cancellation, deadlines, andcontext.WithTimeout.os/signal+signal.NotifyContext— turnSIGINT/SIGTERMinto a cancelled context for graceful shutdown.net/url— parsing, normalization, and relative-link resolution.- Testing:
net/http/httptest, the race detector, andgo.uber.org/goleak.
Implementation Milestones¶
- Single-threaded crawl. One goroutine: fetch seed, parse links, recurse with depth limiting and a visited map. Get correctness first.
- Extract interfaces. Define
FetcherandParser; move HTML parsing tointernal/parse, fetching tointernal/fetch. - Add the worker pool. Introduce the frontier channel and N workers (fan-out), a results channel and collector (fan-in). Make termination work with a pending-work counter.
- Context cancellation. Thread
ctxinto fetch, rate-limit wait, and every channelselect. Add an overall-timeout. - Rate limiting. Add a per-host
rate.Limitermap (guarded), calllimiter.Wait(ctx)before each request. - Graceful shutdown. Use
signal.NotifyContext; onSIGINT, cancel, drain in-flight work, flush results, exit non-zero only on hard failure. - robots.txt + same-host. Fetch and cache robots rules per host; enforce the same-host policy.
- Output formats. Sorted JSON and CSV writers;
-outflag. - Harden.
go test -race,goleakinTestMain, golden-file tests.
Testing Strategy¶
- Fake site graph with
httptest.NewServer. Serve a small, deterministic set of interlinked HTML pages (cycles, dead ends, 404s, off-domain links) fromtestdata/. The crawler hitsserver.URL, so no real network is touched. - Race detector. Run the whole suite under
go test -race ./...in CI. Any unsynchronized access to the visited set or shared state must fail the build. - Dedup and depth tests. Assert that a page reachable by multiple paths is
fetched exactly once, and that no page beyond
maxDepthappears in the output. - Deterministic worker-pool tests. With a fixed fake graph and sorted output,
the sitemap is byte-stable; compare against a golden file. Vary
-workers(1, 4, 16) and assert identical results to prove pool-size independence. - Goroutine leak detection. Add
go.uber.org/goleakviagoleak.VerifyTestMain(m)inTestMain, anddefer goleak.VerifyNone(t)in pipeline tests, to guarantee no worker/collector goroutine outlives the crawl. - Context cancellation tests. Start a crawl against a slow/blocking handler,
cancel the context (or hit the timeout), and assert
Runreturns promptly withcontext.Canceled/DeadlineExceededand that all goroutines have exited. - Rate-limit tests. Point all workers at one host with a low
-rateand assert the observed request timestamps respect the configured QPS.
Deployment¶
- Single static binary.
CGO_ENABLED=0 go build -o crawler ./cmd/crawlerproduces a self-contained binary with no runtime dependencies. - Dockerfile (multi-stage, distroless/scratch):
FROM golang:1.22 AS build
WORKDIR /src
COPY . .
RUN CGO_ENABLED=0 go build -o /crawler ./cmd/crawler
FROM gcr.io/distroless/static:nonroot
COPY --from=build /crawler /crawler
ENTRYPOINT ["/crawler"]
- CLI / cron job. Run on demand or schedule periodic sitemap regeneration via
cron / a Kubernetes
CronJob, writing output to a mounted volume or object store. - Resource limits. Set container CPU/memory limits; the bounded worker pool
and per-host rate limit keep the process well-behaved. Tune
-workersto the CPU allocation and-rateto be a polite citizen of target sites.
Documentation Deliverables¶
- README.md — what it does, install/build, full flag reference, example invocations, and sample output.
- Concurrency model write-up — an explanation of the architecture diagram: how fan-out/fan-in works here, how the pipeline terminates (pending-work counter closing the frontier), how cancellation propagates, and why the visited set is safe. This is the centerpiece doc for an Intermediate concurrency project.
- godoc — doc comments on every exported type and function (
Crawler,Fetcher,Parser,Config,Result), with a runnableExamplefor thecrawlerpackage.
Stretch Goals / Future Improvements¶
- Distributed crawling. Shard the frontier across multiple nodes; coordinate the visited set via a shared store.
- Persistent frontier. Back the queue and visited set with Redis or BoltDB so very large crawls survive restarts and exceed RAM.
- Polite crawling / backoff. Honor
Crawl-delay, add exponential backoff and retry with jitter on 429/5xx, and adaptive per-host concurrency. - JS rendering. Drive a headless browser (chromedp) for client-rendered pages.
- Priority queue. Replace the FIFO frontier with a priority frontier
(by depth, score, or freshness) using a
container/heap. - Resumable crawls. Checkpoint the frontier + visited set so an interrupted crawl can continue where it left off.
- Sitemap diffing. Compare runs to surface added/removed/changed URLs.
Lessons-Learned Prompts¶
- How did you bound parallelism? Why a fixed-size worker pool rather than
go fetch(url)per link, and what would have gone wrong with the naive approach? - How does your pipeline terminate? Walk through the moment the last item is processed: who closes the frontier channel, and how does that unblock every downstream goroutine without a deadlock or a "send on closed channel" panic?
- How did you prevent goroutine leaks? Which goroutine could have been left
blocked on a channel send/receive forever, and how did
contextand channel closing eliminate that? How didgoleakchange your design? - How did you ensure a clean shutdown on
SIGINT? What happens to in-flight requests, and how do you distinguish "cancelled on purpose" from a real error? - Where did the race detector catch you, and what was the underlying shared state? Did you fix it with a mutex or by changing ownership (a single goroutine owning the data)? Why?
- Why
errgroupover a rawsync.WaitGrouphere (or vice versa)? What did first-error-cancels-the-group buy you, and where did you still need a plainWaitGroup?
Portfolio & Resume¶
Resume Bullets¶
- Built a concurrent web crawler in Go using a bounded worker-pool
(fan-out/fan-in) pipeline over channels, achieving controlled parallelism
with zero goroutine leaks (verified via
go.uber.org/goleakand the race detector in CI). - Implemented context-based cancellation and graceful
SIGINTshutdown plus per-host token-bucket rate limiting (golang.org/x/time/rate), so the crawler tears down cleanly on timeout/interrupt and never overloads a target. - Designed deterministic, golden-file-testable output and a fake-site test
harness (
httptest), keeping the crawler correct and reproducible across worker-pool sizes.
Interview Talking Points¶
- Worker pool & fan-out/fan-in: why a fixed N of goroutines reading one frontier channel beats unbounded goroutine spawning, and how fan-in into a single collector lets the sitemap map stay lock-free by ownership.
- Pipeline termination: the pending-work counter that closes the frontier and cascades shutdown — the part people get wrong and deadlock on.
- Context propagation: one
ctxflowing throughrate.Limiter.Wait, the HTTP request, and everyselect, so a single cancel unwinds everything;signal.NotifyContextfor shutdown. - Race detector & goleak: how
-raceandgoleak.VerifyTestMainturned "probably correct" concurrency into proven correct, and the bugs they caught. errgroupvs rawWaitGroup: first-error-cancellation and a tidyWait() errorfromerrgroup, versus the manual coordination of a bareWaitGroup, and when each is the right tool.