Skip to content

Day 138 — Tests & Metrics

Month 5 · Week 4 · ⬅ Day 137 · Day 139 ➡ · Journal index

🎯 Learning Objective

Lock the service's behavior down with fast, table-driven and concurrency-safe tests, and expose operational visibility with Prometheus-style counters served over HTTP.

📚 Topics

  • Testing the queue/worker with fakes + bufconn for the gRPC edge
  • go test -race; deterministic concurrency tests
  • Counters/labels; the Prometheus text exposition format; the RED method

📖 Reading / Sources

📝 Notes

  • Test the core with fakes, the edge with bufconn. The use cases get unit tests using an in-memory Queue fake (no Redis); the gRPC handler gets a bufconn test exercising the real codec + interceptors (from Day 125) → [[fakes]] [[bufconn]].
  • Concurrency tests must be deterministic. Drive N goroutines, WaitGroup.Wait(), then assert an exact total. Run under go test -race so the race detector flags unsynchronized access — a mutex/atomic that's "probably fine" usually isn't → [[race-detector]] [[waitgroup]].
  • A counter is a monotonically increasing total; the scraper computes rates from successive scrapes. A gauge goes up and down (queue depth, in-flight workers); a histogram buckets observations (latency) → [[counter]] [[histogram]].
  • The same metric name with different label sets is a separate time series. Watch cardinality: never put unbounded values (user IDs, job IDs) in labels — it explodes memory → [[label-cardinality]].
  • Instrument with the RED method: Rate, Errors, Duration per RPC. A requests_total{method,code} counter and a latency histogram already answer most "is it healthy?" questions → [[red-method]].
  • Metrics are mutated from many goroutines, so the registry needs a mutex or atomics; Snapshot should return a copy so readers don't race future writes → [[mutex]] [[defensive-copy]].
  • Prometheus scrapes a plain-text HTTP endpoint: # HELP, # TYPE, then name{labels} value lines. Real services mount promhttp.Handler() at /metrics; the format itself is trivial to render by hand → [[exposition-format]].

💻 Code Examples

// Concurrency-safe counter test that must pass `go test -race`.
// (Full registry + HTTP exposition is runnable in examples/month-05/metrics.)
func TestConcurrentInc(t *testing.T) {
    r := metrics.New()
    const goroutines, perG = 50, 200
    var wg sync.WaitGroup
    for g := 0; g < goroutines; g++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for i := 0; i < perG; i++ {
                r.Inc("jobs_processed_total", map[string]string{"status": "ok"})
            }
        }()
    }
    wg.Wait()
    if got := r.Get("jobs_processed_total", map[string]string{"status": "ok"}); got != goroutines*perG {
        t.Fatalf("total = %d, want %d", got, goroutines*perG)
    }
}

Runnable counter registry + /metrics scrape over httptest: examples/month-05/metrics · Run: go run ./examples/month-05/metrics

🏋️ Exercises / Practice

Exercise Status Link
Concurrency-safe labeled counter registry (race-clean) exercises/month-05/week-4/metrics
Retry budget + DLQ (deterministic table tests) exercises/month-05/week-4/deadletter

🐛 Mistakes Made

  • Built series keys straight from map iteration → labels in random order produced different keys for the same series. Sorted label keys before joining.
  • A concurrency test passed without -race but the registry mutated a shared map unguarded; -race exposed it. Added a mutex and re-ran with -race.
  • Snapshot returned the internal map; a caller mutated it and corrupted counts. Returned a copy.

❓ Open Questions

  • When is a histogram's default bucket layout wrong enough to warrant custom buckets (and how do I pick them)?

🧠 Active Recall (answer without looking)

  1. Q: Why must SeriesKey sort the label keys, and why is putting a job ID in a label dangerous?
AMap iteration order is randomized, so unsorted keys would produce different strings for the same label set, splitting one series into many. A job ID is unbounded-cardinality: each unique value creates a new time series, exploding memory and scrape size.
  1. Q: What does go test -race add over a normal run, and why is a passing non-race run not enough?
AThe race detector instruments memory accesses and reports concurrent unsynchronized read/write to the same location. Data races are timing-dependent, so a plain run can pass by luck while the code is still buggy; `-race` surfaces the hazard deterministically.

🪶 Feynman Reflection

Tests are a tripwire: I wire N goroutines to hammer the code, wait for all of them, and check the count is exactly right — and I run it under -race so a hidden data race trips the wire instead of hiding until production. Metrics are the dashboard gauges: a counter is an odometer that only climbs, and the scraper reads it every few seconds to see how fast the numbers move (the rate). Labels split one gauge into many — useful, but only for low-cardinality dimensions.

🕳️ Knowledge Gaps

  • Wiring a metrics interceptor cleanly so every RPC updates RED metrics without per-handler boilerplate.

✅ Summary

I can test the core with fakes and the gRPC edge with bufconn, write deterministic concurrency tests that pass go test -race, and expose RED-style counters in the Prometheus exposition format with safe, low-cardinality, copy-on-snapshot semantics.

⏭️ Next Steps / Prep for Tomorrow

  • Day 139: write the docs — package docs/godoc, a project README, and Architecture Decision Records.

Time spent Difficulty Confidence
90 min 🟦🟦⬜⬜⬜ 🟦🟦🟦⬜⬜

Suggested commit: test(worker): race-clean tests + prometheus-style metrics (day 138)