Table of Contents
- Concurrent Web Crawler
- Overview
- Highlights
- Architecture
- Run
- Test
- Project Layout
- Testing Strategy
- Lessons Learned
- Future Improvements
- 🎒 Portfolio
Concurrent Web Crawler¶
A bounded, cancellable, fan-out/fan-in web crawler in pure Go — the Month 3 (concurrency) capstone.
Overview¶
Given one or more seed URLs, the crawler fetches pages, extracts links, and follows them up to a configurable depth while staying on the seed host(s). It demonstrates the core Go concurrency patterns end-to-end with zero third-party dependencies — so it builds and is race-tested in the repo's root CI. See the full brief in SPEC.md.
Highlights¶
- Worker pool — a fixed set of goroutines bounds parallelism.
- Fan-out / fan-in — URLs dispatched to workers; results merged on one channel.
contexteverywhere — a single context threads cancellation + timeout through every stage; SIGINT or a deadline tears the whole pipeline down cleanly (graceful shutdown).- Per-host rate limiting — a token-bucket limiter keeps the crawler polite.
- Dedup — a concurrency-safe visited-set (
pkg/urlset) prevents re-fetching. - robots.txt awareness and JSON/CSV reporting.
- Race-clean — verified with
go test -race, including a goroutine-leak check.
Architecture¶
flowchart LR
seeds[Seeds] --> Q{{frontier chan}}
Q --> W1[worker] & W2[worker] & W3[worker]
W1 & W2 & W3 --> F[fetch + rate limit]
F --> P[parse links]
P --> V[visited set]
V -->|new, in-scope, depth ok| Q
W1 & W2 & W3 --> R{{results chan}}
R --> O[(JSON/CSV report)]
ctx[(context: cancel + timeout)] -.-> W1 & W2 & W3 & F
Run¶
# from the repo root
go run ./projects/04-concurrent-web-crawler/cmd/crawler \
-url=https://example.com -depth=3 -workers=16 -rate=10
Test¶
Project Layout¶
cmd/crawler/ entrypoint + flag parsing + signal handling
internal/crawler/ orchestration: worker pool, fan-in/out, depth/scope
internal/fetch/ http.Client wrapper + per-host rate limiter
internal/parse/ HTML link extraction
internal/ratelimit/ token-bucket limiter (stdlib only)
internal/robots/ robots.txt parsing
internal/output/ JSON / CSV report writers
internal/config/ configuration + defaults
pkg/urlset/ concurrency-safe visited set
Testing Strategy¶
Table-driven unit tests per package, an httptest-backed integration test that asserts depth/scope/dedup and 404 handling, and an explicit goroutine-leak assertion (go test -race).
Lessons Learned¶
- Every goroutine needs a guaranteed exit path — the leak test exists to prove it.
- Idle keep-alive connections hold goroutines; closing them is part of clean shutdown.
- A single
contextis the simplest cancellation backbone for a multi-stage pipeline.
Future Improvements¶
- Persistent frontier (resume across runs); politeness per-host crawl-delay from robots.
- Pluggable storage sink; sitemap output; metrics.
🎒 Portfolio¶
Résumé bullets:
- "Built a bounded concurrent web crawler in Go (worker pool + fan-in/out +
contextcancellation + per-host rate limiting) that is verified goroutine-leak-free and race-clean (go test -race)."
Interview talking points: worker-pool vs unbounded goroutines; how one context cancels a whole pipeline; detecting/avoiding goroutine leaks; backpressure via channel capacity.
⬅ Projects · Repo README