Concurrent Web Crawler¶

A bounded, cancellable, fan-out/fan-in web crawler in pure Go — the Month 3 (concurrency) capstone.

Overview¶

Given one or more seed URLs, the crawler fetches pages, extracts links, and follows them up to a configurable depth while staying on the seed host(s). It demonstrates the core Go concurrency patterns end-to-end with zero third-party dependencies — so it builds and is race-tested in the repo's root CI. See the full brief in SPEC.md.

Highlights¶

Worker pool — a fixed set of goroutines bounds parallelism.
Fan-out / fan-in — URLs dispatched to workers; results merged on one channel.
context everywhere — a single context threads cancellation + timeout through every stage; SIGINT or a deadline tears the whole pipeline down cleanly (graceful shutdown).
Per-host rate limiting — a token-bucket limiter keeps the crawler polite.
Dedup — a concurrency-safe visited-set (pkg/urlset) prevents re-fetching.
robots.txt awareness and JSON/CSV reporting.
Race-clean — verified with go test -race, including a goroutine-leak check.

Architecture¶

flowchart LR
    seeds[Seeds] --> Q{{frontier chan}}
    Q --> W1[worker] & W2[worker] & W3[worker]
    W1 & W2 & W3 --> F[fetch + rate limit]
    F --> P[parse links]
    P --> V[visited set]
    V -->|new, in-scope, depth ok| Q
    W1 & W2 & W3 --> R{{results chan}}
    R --> O[(JSON/CSV report)]
    ctx[(context: cancel + timeout)] -.-> W1 & W2 & W3 & F

Run¶

# from the repo root
go run ./projects/04-concurrent-web-crawler/cmd/crawler \
    -url=https://example.com -depth=3 -workers=16 -rate=10

Test¶

go test -race ./projects/04-concurrent-web-crawler/...

Project Layout¶

cmd/crawler/        entrypoint + flag parsing + signal handling
internal/crawler/   orchestration: worker pool, fan-in/out, depth/scope
internal/fetch/     http.Client wrapper + per-host rate limiter
internal/parse/     HTML link extraction
internal/ratelimit/ token-bucket limiter (stdlib only)
internal/robots/    robots.txt parsing
internal/output/    JSON / CSV report writers
internal/config/    configuration + defaults
pkg/urlset/         concurrency-safe visited set

Testing Strategy¶

Table-driven unit tests per package, an httptest-backed integration test that asserts depth/scope/dedup and 404 handling, and an explicit goroutine-leak assertion (go test -race).

Lessons Learned¶

Every goroutine needs a guaranteed exit path — the leak test exists to prove it.
Idle keep-alive connections hold goroutines; closing them is part of clean shutdown.
A single context is the simplest cancellation backbone for a multi-stage pipeline.

Future Improvements¶

Persistent frontier (resume across runs); politeness per-host crawl-delay from robots.
Pluggable storage sink; sitemap output; metrics.

🎒 Portfolio¶

Résumé bullets:

"Built a bounded concurrent web crawler in Go (worker pool + fan-in/out + context cancellation + per-host rate limiting) that is verified goroutine-leak-free and race-clean (go test -race)."

Interview talking points: worker-pool vs unbounded goroutines; how one context cancels a whole pipeline; detecting/avoiding goroutine leaks; backpressure via channel capacity.

⬅ Projects · Repo README