Skip to content

Table of Contents

04 — Concurrent Web Crawler

Overview

A concurrent web crawler that, given one or more seed URLs, fetches pages, extracts links, and recursively follows them up to a configurable depth while staying within a single domain (or an allow-list of hosts). The crawler emits a sitemap of discovered URLs and the links found on each page.

The point of this project is not to build a Googlebot competitor — it is to build a textbook example of a bounded, leak-free, cancellable concurrency pipeline in idiomatic Go. You will use a fixed-size worker pool, channels for the URL frontier and results, context.Context for cancellation and timeouts, a per-host rate limiter, and proper graceful shutdown on SIGINT. Every goroutine you start must have a clearly defined exit condition so that go test -race and go.uber.org/goleak both come back clean.

Level: Intermediate. Expect to spend most of your time thinking about lifetimes — of goroutines, of channels, and of the crawl itself — rather than about HTML parsing.

Learning Objectives

By the end of this project you should be able to:

  • Design a fan-out / fan-in pipeline using channels and a bounded worker pool.
  • Choose a sensible degree of parallelism and bound it deliberately instead of spawning one goroutine per URL.
  • Propagate cancellation and deadlines with context.Context so that a single cancel signal tears down the entire pipeline.
  • Implement graceful shutdown on os.Interrupt / SIGINT without dropping in-flight work or leaking goroutines.
  • Use golang.org/x/sync/errgroup and understand when it is preferable to a raw sync.WaitGroup.
  • Apply client-side rate limiting per host with golang.org/x/time/rate.
  • Deduplicate work safely under concurrency (mutex-guarded map vs. owner goroutine) and reason about the trade-offs.
  • Detect and eliminate goroutine leaks and data races with tooling.
  • Parse HTML robustly with golang.org/x/net/html and resolve relative links.
  • Respect robots.txt and a same-domain policy as a matter of crawler etiquette.

Requirements

Functional

  • Seed URL(s): Accept one or more seed URLs from a flag (repeatable -url) or a seeds file. Validate and normalize them before crawling.
  • Follow links to max depth: Parse each fetched page, extract <a href> links, resolve them against the page's base URL, and enqueue them with depth+1. Stop descending once depth > maxDepth.
  • Same-domain restriction: By default, only follow links whose host matches the seed host (configurable to an allow-list of hosts, or "any host"). Reject off-domain, non-HTTP(S), and mailto:/javascript: links.
  • Deduplication: Never fetch the same normalized URL twice within a run.
  • Output sitemap / links: Produce a machine-readable report mapping each crawled URL to its status code, discovered links, depth, and any error.
  • robots.txt: Fetch and honor the host's /robots.txt (Disallow rules and Crawl-delay) for the configured user-agent before crawling a path.

Non-Functional

  • Configurable concurrency: Number of worker goroutines is set by -workers (default e.g. 8). The pool size is fixed for the lifetime of the crawl.
  • Rate limit per host: Cap requests-per-second per host via -rate, enforced with a rate.Limiter so a single target is never hammered.
  • Bounded memory: The frontier and visited set must not grow without bound in pathological cases; channel buffers and the dedup set are the only unbounded structures and are bounded by the (finite) same-domain URL space at a depth.
  • No goroutine leaks: Every goroutine exits when the crawl finishes or is cancelled. Verified with goleak in tests.
  • Graceful shutdown: A first SIGINT cancels the context, lets in-flight fetches finish (or time out), flushes collected results, and exits cleanly.
  • Deterministic-ish output: Because crawl order is concurrent, the final report is sorted (by URL) so that runs over a static site graph produce byte-stable output suitable for golden-file tests.

Architecture

The crawler is a classic fan-out / fan-in pipeline. A single frontier channel feeds N identical workers (fan-out); every worker writes to one shared results channel that a single collector drains (fan-in). A context.Context threads through every stage; cancelling it (timeout, error, or SIGINT) unblocks every select and drains the pipeline.

A subtle point: the number of "outstanding" URLs (enqueued but not yet processed) is tracked with a counter (an errgroup/WaitGroup or an atomic). When it reaches zero, the producer side closes the frontier, which lets workers return, which lets the collector finish. This is what makes the crawl terminate rather than block forever on an unbuffered channel.

flowchart TD
    SEED[Seed URLs] --> FRONTIER

    subgraph PIPE[Crawl Pipeline]
        FRONTIER([URL frontier channel]) -->|fan-out| W1[Worker 1]
        FRONTIER -->|fan-out| W2[Worker 2]
        FRONTIER -->|fan-out| WN[Worker N]

        W1 --> FETCH
        W2 --> FETCH
        WN --> FETCH

        FETCH[Fetcher: rate-limited http.Client] --> PARSE[Parser: extract links]
        PARSE --> DEDUP{Visited set?}
        DEDUP -->|new| FRONTIER
        DEDUP -->|seen / off-domain / too deep| DROP[Discard]

        W1 -->|fan-in| RESULTS
        W2 -->|fan-in| RESULTS
        WN -->|fan-in| RESULTS
    end

    RESULTS([Results channel]) --> COLLECTOR[Collector: build sitemap]
    COLLECTOR --> OUT[(JSON / CSV sitemap)]

    CTX[[context.Context cancel / timeout / SIGINT]] -.cancels.-> W1
    CTX -.cancels.-> W2
    CTX -.cancels.-> WN
    CTX -.cancels.-> FETCH
    CTX -.cancels.-> COLLECTOR

Fan-out: one frontier channel is read by N worker goroutines. Go's runtime load-balances receives across them, so faster workers naturally pick up more work. Parallelism is bounded exactly by N — no matter how many links a page has.

Fan-in: all workers send Result values into one results channel. A single collector goroutine owns the sitemap map, so the map needs no lock — it is only ever touched by its owner (the "share memory by communicating" idiom).

Cancellation: the rate limiter's Wait(ctx), the HTTP request (http.NewRequestWithContext), and every channel select all observe the same ctx. One cancel() call therefore unblocks the whole graph.

Suggested Project Layout

Following golang-standards/project-layout:

04-concurrent-web-crawler/
├── cmd/
│   └── crawler/
│       └── main.go            # flag parsing, signal handling, wires it together
├── internal/
│   ├── crawler/
│   │   ├── crawler.go         # Crawler type, Run(ctx), pipeline orchestration
│   │   ├── worker.go          # worker goroutine loop (fetch -> parse -> enqueue)
│   │   └── frontier.go        # URL frontier channel + visited dedup set
│   ├── fetch/
│   │   ├── client.go          # rate-limited http.Client wrapper (Fetcher impl)
│   │   └── robots.go          # robots.txt fetch + matching
│   ├── parse/
│   │   └── links.go           # extract + resolve links via x/net/html (Parser)
│   └── config/
│       └── config.go          # Config struct, flag binding, validation/defaults
├── pkg/
│   └── urlset/
│       └── urlset.go          # reusable concurrent-safe visited set (optional)
├── testdata/
│   └── site/                  # static HTML graph served by httptest in tests
├── go.mod
├── go.sum
├── Dockerfile
└── README.md

internal/ holds packages that should not be imported by other modules. Anything genuinely reusable (e.g. a generic concurrent set) can graduate to pkg/.

Data Model / Database

This project is in-memory. There is no database in the core path; results may optionally be persisted to JSON or CSV at the end.

Visited set (deduplication). Two valid designs — pick one and justify it:

// Option A: mutex-guarded map (simple, fine for this scale).
type VisitedSet struct {
    mu   sync.Mutex
    seen map[string]struct{}
}

func (v *VisitedSet) Add(url string) (added bool) {
    v.mu.Lock()
    defer v.mu.Unlock()
    if _, ok := v.seen[url]; ok {
        return false
    }
    v.seen[url] = struct{}{}
    return true
}

Option B: a dedicated owner goroutine that owns the map and exposes add/check requests over a channel. No mutex; serialization is by goroutine ownership. Cleaner conceptually, slightly more plumbing. Discuss the trade-off in your README.

map[string]struct{} is used (not map[string]bool) because the value carries no information — struct{}{} is zero-width.

Page / Result struct. What the collector accumulates:

type Result struct {
    URL        string   `json:"url"`
    StatusCode int      `json:"status_code"`
    Links      []string `json:"links"`
    Depth      int      `json:"depth"`
    Error      string   `json:"error,omitempty"`
}

Frontier queue semantics. The frontier is a buffered channel of work items:

type Item struct {
    URL   string
    Depth int
}

Enqueue only happens after VisitedSet.Add returns true and Depth <= maxDepth and the host passes the same-domain check. A pending-work counter (sync.WaitGroup or atomic int) is incremented on enqueue and decremented when an item is fully processed; when it hits zero the frontier is closed and the pipeline winds down. Optionally persist the final []Result to out.json / out.csv.

API Design

CLI surface

crawler [flags]

  -url string        Seed URL to crawl (repeatable). Required.
  -depth int         Maximum link depth from the seed (default 2).
  -workers int       Number of concurrent worker goroutines (default 8).
  -rate float        Max requests per second per host (default 5).
  -timeout duration  Overall crawl timeout, e.g. 30s, 2m (default 60s).
  -out string        Output file path; .json or .csv (default "-" = stdout).
  -same-host         Restrict crawl to the seed host (default true).
  -ua string         User-Agent string sent with each request.

Example:

crawler -url=https://example.com -depth=3 -workers=16 -rate=10 \
        -timeout=2m -out=sitemap.json

Internal interfaces

Small interfaces make the pipeline testable (swap in fakes, no network):

// Fetcher retrieves a single URL, honoring ctx for cancellation/timeout.
type Fetcher interface {
    Fetch(ctx context.Context, url string) (status int, body io.ReadCloser, err error)
}

// Parser extracts absolute, resolved links from an HTML body.
type Parser interface {
    Links(base string, body io.Reader) ([]string, error)
}

The real Fetcher wraps *http.Client plus a per-host *rate.Limiter; the test Fetcher serves from an in-memory map or an httptest.Server.

Optional HTTP endpoint

A thin server can trigger crawls on demand:

POST /crawl   {"url":"https://example.com","depth":2}  -> 202 Accepted + job id
GET  /crawl/{id}                                        -> status + sitemap JSON

Example output (JSON sitemap)

{
  "seed": "https://example.com",
  "started_at": "2026-06-26T10:00:00Z",
  "pages": [
    {
      "url": "https://example.com/",
      "status_code": 200,
      "depth": 0,
      "links": ["https://example.com/about", "https://example.com/blog"]
    },
    {
      "url": "https://example.com/about",
      "status_code": 200,
      "depth": 1,
      "links": ["https://example.com/"]
    },
    {
      "url": "https://example.com/blog",
      "status_code": 404,
      "depth": 1,
      "links": [],
      "error": "unexpected status 404"
    }
  ]
}

Tech Stack

  • Go (1.22+).
  • net/http — HTTP client with a tuned Transport (MaxIdleConnsPerHost, timeouts) and http.NewRequestWithContext.
  • golang.org/x/net/html — streaming HTML tokenizer/parser for link extraction (no regex on HTML).
  • golang.org/x/time/rate — token-bucket rate limiter, one per host.
  • golang.org/x/sync/errgroup — bounded worker group with first-error propagation and context cancellation.
  • context — cancellation, deadlines, and context.WithTimeout.
  • os/signal + signal.NotifyContext — turn SIGINT/SIGTERM into a cancelled context for graceful shutdown.
  • net/url — parsing, normalization, and relative-link resolution.
  • Testing: net/http/httptest, the race detector, and go.uber.org/goleak.

Implementation Milestones

  • Single-threaded crawl. One goroutine: fetch seed, parse links, recurse with depth limiting and a visited map. Get correctness first.
  • Extract interfaces. Define Fetcher and Parser; move HTML parsing to internal/parse, fetching to internal/fetch.
  • Add the worker pool. Introduce the frontier channel and N workers (fan-out), a results channel and collector (fan-in). Make termination work with a pending-work counter.
  • Context cancellation. Thread ctx into fetch, rate-limit wait, and every channel select. Add an overall -timeout.
  • Rate limiting. Add a per-host rate.Limiter map (guarded), call limiter.Wait(ctx) before each request.
  • Graceful shutdown. Use signal.NotifyContext; on SIGINT, cancel, drain in-flight work, flush results, exit non-zero only on hard failure.
  • robots.txt + same-host. Fetch and cache robots rules per host; enforce the same-host policy.
  • Output formats. Sorted JSON and CSV writers; -out flag.
  • Harden. go test -race, goleak in TestMain, golden-file tests.

Testing Strategy

  • Fake site graph with httptest.NewServer. Serve a small, deterministic set of interlinked HTML pages (cycles, dead ends, 404s, off-domain links) from testdata/. The crawler hits server.URL, so no real network is touched.
  • Race detector. Run the whole suite under go test -race ./... in CI. Any unsynchronized access to the visited set or shared state must fail the build.
  • Dedup and depth tests. Assert that a page reachable by multiple paths is fetched exactly once, and that no page beyond maxDepth appears in the output.
  • Deterministic worker-pool tests. With a fixed fake graph and sorted output, the sitemap is byte-stable; compare against a golden file. Vary -workers (1, 4, 16) and assert identical results to prove pool-size independence.
  • Goroutine leak detection. Add go.uber.org/goleak via goleak.VerifyTestMain(m) in TestMain, and defer goleak.VerifyNone(t) in pipeline tests, to guarantee no worker/collector goroutine outlives the crawl.
  • Context cancellation tests. Start a crawl against a slow/blocking handler, cancel the context (or hit the timeout), and assert Run returns promptly with context.Canceled/DeadlineExceeded and that all goroutines have exited.
  • Rate-limit tests. Point all workers at one host with a low -rate and assert the observed request timestamps respect the configured QPS.

Deployment

  • Single static binary. CGO_ENABLED=0 go build -o crawler ./cmd/crawler produces a self-contained binary with no runtime dependencies.
  • Dockerfile (multi-stage, distroless/scratch):
FROM golang:1.22 AS build
WORKDIR /src
COPY . .
RUN CGO_ENABLED=0 go build -o /crawler ./cmd/crawler

FROM gcr.io/distroless/static:nonroot
COPY --from=build /crawler /crawler
ENTRYPOINT ["/crawler"]
  • CLI / cron job. Run on demand or schedule periodic sitemap regeneration via cron / a Kubernetes CronJob, writing output to a mounted volume or object store.
  • Resource limits. Set container CPU/memory limits; the bounded worker pool and per-host rate limit keep the process well-behaved. Tune -workers to the CPU allocation and -rate to be a polite citizen of target sites.

Documentation Deliverables

  • README.md — what it does, install/build, full flag reference, example invocations, and sample output.
  • Concurrency model write-up — an explanation of the architecture diagram: how fan-out/fan-in works here, how the pipeline terminates (pending-work counter closing the frontier), how cancellation propagates, and why the visited set is safe. This is the centerpiece doc for an Intermediate concurrency project.
  • godoc — doc comments on every exported type and function (Crawler, Fetcher, Parser, Config, Result), with a runnable Example for the crawler package.

Stretch Goals / Future Improvements

  • Distributed crawling. Shard the frontier across multiple nodes; coordinate the visited set via a shared store.
  • Persistent frontier. Back the queue and visited set with Redis or BoltDB so very large crawls survive restarts and exceed RAM.
  • Polite crawling / backoff. Honor Crawl-delay, add exponential backoff and retry with jitter on 429/5xx, and adaptive per-host concurrency.
  • JS rendering. Drive a headless browser (chromedp) for client-rendered pages.
  • Priority queue. Replace the FIFO frontier with a priority frontier (by depth, score, or freshness) using a container/heap.
  • Resumable crawls. Checkpoint the frontier + visited set so an interrupted crawl can continue where it left off.
  • Sitemap diffing. Compare runs to surface added/removed/changed URLs.

Lessons-Learned Prompts

  1. How did you bound parallelism? Why a fixed-size worker pool rather than go fetch(url) per link, and what would have gone wrong with the naive approach?
  2. How does your pipeline terminate? Walk through the moment the last item is processed: who closes the frontier channel, and how does that unblock every downstream goroutine without a deadlock or a "send on closed channel" panic?
  3. How did you prevent goroutine leaks? Which goroutine could have been left blocked on a channel send/receive forever, and how did context and channel closing eliminate that? How did goleak change your design?
  4. How did you ensure a clean shutdown on SIGINT? What happens to in-flight requests, and how do you distinguish "cancelled on purpose" from a real error?
  5. Where did the race detector catch you, and what was the underlying shared state? Did you fix it with a mutex or by changing ownership (a single goroutine owning the data)? Why?
  6. Why errgroup over a raw sync.WaitGroup here (or vice versa)? What did first-error-cancels-the-group buy you, and where did you still need a plain WaitGroup?

Portfolio & Resume

Resume Bullets

  • Built a concurrent web crawler in Go using a bounded worker-pool (fan-out/fan-in) pipeline over channels, achieving controlled parallelism with zero goroutine leaks (verified via go.uber.org/goleak and the race detector in CI).
  • Implemented context-based cancellation and graceful SIGINT shutdown plus per-host token-bucket rate limiting (golang.org/x/time/rate), so the crawler tears down cleanly on timeout/interrupt and never overloads a target.
  • Designed deterministic, golden-file-testable output and a fake-site test harness (httptest), keeping the crawler correct and reproducible across worker-pool sizes.

Interview Talking Points

  • Worker pool & fan-out/fan-in: why a fixed N of goroutines reading one frontier channel beats unbounded goroutine spawning, and how fan-in into a single collector lets the sitemap map stay lock-free by ownership.
  • Pipeline termination: the pending-work counter that closes the frontier and cascades shutdown — the part people get wrong and deadlock on.
  • Context propagation: one ctx flowing through rate.Limiter.Wait, the HTTP request, and every select, so a single cancel unwinds everything; signal.NotifyContext for shutdown.
  • Race detector & goleak: how -race and goleak.VerifyTestMain turned "probably correct" concurrency into proven correct, and the bugs they caught.
  • errgroup vs raw WaitGroup: first-error-cancellation and a tidy Wait() error from errgroup, versus the manual coordination of a bare WaitGroup, and when each is the right tool.