Skip to content

Table of Contents

Concurrent Web Crawler

A bounded, cancellable, fan-out/fan-in web crawler in pure Go — the Month 3 (concurrency) capstone.

level Go deps tests

Overview

Given one or more seed URLs, the crawler fetches pages, extracts links, and follows them up to a configurable depth while staying on the seed host(s). It demonstrates the core Go concurrency patterns end-to-end with zero third-party dependencies — so it builds and is race-tested in the repo's root CI. See the full brief in SPEC.md.

Highlights

  • Worker pool — a fixed set of goroutines bounds parallelism.
  • Fan-out / fan-in — URLs dispatched to workers; results merged on one channel.
  • context everywhere — a single context threads cancellation + timeout through every stage; SIGINT or a deadline tears the whole pipeline down cleanly (graceful shutdown).
  • Per-host rate limiting — a token-bucket limiter keeps the crawler polite.
  • Dedup — a concurrency-safe visited-set (pkg/urlset) prevents re-fetching.
  • robots.txt awareness and JSON/CSV reporting.
  • Race-clean — verified with go test -race, including a goroutine-leak check.

Architecture

flowchart LR
    seeds[Seeds] --> Q{{frontier chan}}
    Q --> W1[worker] & W2[worker] & W3[worker]
    W1 & W2 & W3 --> F[fetch + rate limit]
    F --> P[parse links]
    P --> V[visited set]
    V -->|new, in-scope, depth ok| Q
    W1 & W2 & W3 --> R{{results chan}}
    R --> O[(JSON/CSV report)]
    ctx[(context: cancel + timeout)] -.-> W1 & W2 & W3 & F

Run

# from the repo root
go run ./projects/04-concurrent-web-crawler/cmd/crawler \
    -url=https://example.com -depth=3 -workers=16 -rate=10

Test

go test -race ./projects/04-concurrent-web-crawler/...

Project Layout

cmd/crawler/        entrypoint + flag parsing + signal handling
internal/crawler/   orchestration: worker pool, fan-in/out, depth/scope
internal/fetch/     http.Client wrapper + per-host rate limiter
internal/parse/     HTML link extraction
internal/ratelimit/ token-bucket limiter (stdlib only)
internal/robots/    robots.txt parsing
internal/output/    JSON / CSV report writers
internal/config/    configuration + defaults
pkg/urlset/         concurrency-safe visited set

Testing Strategy

Table-driven unit tests per package, an httptest-backed integration test that asserts depth/scope/dedup and 404 handling, and an explicit goroutine-leak assertion (go test -race).

Lessons Learned

  • Every goroutine needs a guaranteed exit path — the leak test exists to prove it.
  • Idle keep-alive connections hold goroutines; closing them is part of clean shutdown.
  • A single context is the simplest cancellation backbone for a multi-stage pipeline.

Future Improvements

  • Persistent frontier (resume across runs); politeness per-host crawl-delay from robots.
  • Pluggable storage sink; sitemap output; metrics.

🎒 Portfolio

Résumé bullets:

  • "Built a bounded concurrent web crawler in Go (worker pool + fan-in/out + context cancellation + per-host rate limiting) that is verified goroutine-leak-free and race-clean (go test -race)."

Interview talking points: worker-pool vs unbounded goroutines; how one context cancels a whole pipeline; detecting/avoiding goroutine leaks; backpressure via channel capacity.


Projects · Repo README