Day 138 — Tests & Metrics¶
Month 5 · Week 4 · ⬅ Day 137 · Day 139 ➡ · Journal index
🎯 Learning Objective¶
Lock the service's behavior down with fast, table-driven and concurrency-safe tests, and expose operational visibility with Prometheus-style counters served over HTTP.
📚 Topics¶
- Testing the queue/worker with fakes +
bufconnfor the gRPC edge go test -race; deterministic concurrency tests- Counters/labels; the Prometheus text exposition format; the RED method
📖 Reading / Sources¶
- Go blog — Using subtests and table-driven tests
-
testingpackage —t.Parallel,t.Cleanup - Prometheus — metric types & exposition format
-
prometheus/client_golang(API reference)
📝 Notes¶
- Test the core with fakes, the edge with bufconn. The use cases get unit tests using an in-memory
Queuefake (no Redis); the gRPC handler gets abufconntest exercising the real codec + interceptors (from Day 125) → [[fakes]] [[bufconn]]. - Concurrency tests must be deterministic. Drive N goroutines,
WaitGroup.Wait(), then assert an exact total. Run undergo test -raceso the race detector flags unsynchronized access — a mutex/atomic that's "probably fine" usually isn't → [[race-detector]] [[waitgroup]]. - A counter is a monotonically increasing total; the scraper computes rates from successive scrapes. A gauge goes up and down (queue depth, in-flight workers); a histogram buckets observations (latency) → [[counter]] [[histogram]].
- The same metric name with different label sets is a separate time series. Watch cardinality: never put unbounded values (user IDs, job IDs) in labels — it explodes memory → [[label-cardinality]].
- Instrument with the RED method: Rate, Errors, Duration per RPC. A
requests_total{method,code}counter and a latency histogram already answer most "is it healthy?" questions → [[red-method]]. - Metrics are mutated from many goroutines, so the registry needs a mutex or atomics;
Snapshotshould return a copy so readers don't race future writes → [[mutex]] [[defensive-copy]]. - Prometheus scrapes a plain-text HTTP endpoint:
# HELP,# TYPE, thenname{labels} valuelines. Real services mountpromhttp.Handler()at/metrics; the format itself is trivial to render by hand → [[exposition-format]].
💻 Code Examples¶
// Concurrency-safe counter test that must pass `go test -race`.
// (Full registry + HTTP exposition is runnable in examples/month-05/metrics.)
func TestConcurrentInc(t *testing.T) {
r := metrics.New()
const goroutines, perG = 50, 200
var wg sync.WaitGroup
for g := 0; g < goroutines; g++ {
wg.Add(1)
go func() {
defer wg.Done()
for i := 0; i < perG; i++ {
r.Inc("jobs_processed_total", map[string]string{"status": "ok"})
}
}()
}
wg.Wait()
if got := r.Get("jobs_processed_total", map[string]string{"status": "ok"}); got != goroutines*perG {
t.Fatalf("total = %d, want %d", got, goroutines*perG)
}
}
Runnable counter registry +
/metricsscrape overhttptest:examples/month-05/metrics· Run:go run ./examples/month-05/metrics
🏋️ Exercises / Practice¶
| Exercise | Status | Link |
|---|---|---|
| Concurrency-safe labeled counter registry (race-clean) | ✅ | exercises/month-05/week-4/metrics |
| Retry budget + DLQ (deterministic table tests) | ✅ | exercises/month-05/week-4/deadletter |
🐛 Mistakes Made¶
- Built series keys straight from
mapiteration → labels in random order produced different keys for the same series. Sorted label keys before joining. - A concurrency test passed without
-racebut the registry mutated a shared map unguarded;-raceexposed it. Added a mutex and re-ran with-race. Snapshotreturned the internal map; a caller mutated it and corrupted counts. Returned a copy.
❓ Open Questions¶
- When is a histogram's default bucket layout wrong enough to warrant custom buckets (and how do I pick them)?
🧠 Active Recall (answer without looking)¶
- Q: Why must
SeriesKeysort the label keys, and why is putting a job ID in a label dangerous?
A
Map iteration order is randomized, so unsorted keys would produce different strings for the same label set, splitting one series into many. A job ID is unbounded-cardinality: each unique value creates a new time series, exploding memory and scrape size.- Q: What does
go test -raceadd over a normal run, and why is a passing non-race run not enough?
A
The race detector instruments memory accesses and reports concurrent unsynchronized read/write to the same location. Data races are timing-dependent, so a plain run can pass by luck while the code is still buggy; `-race` surfaces the hazard deterministically.🪶 Feynman Reflection¶
Tests are a tripwire: I wire N goroutines to hammer the code, wait for all of them, and check the count is exactly right — and I run it under -race so a hidden data race trips the wire instead of hiding until production. Metrics are the dashboard gauges: a counter is an odometer that only climbs, and the scraper reads it every few seconds to see how fast the numbers move (the rate). Labels split one gauge into many — useful, but only for low-cardinality dimensions.
🕳️ Knowledge Gaps¶
- Wiring a metrics interceptor cleanly so every RPC updates RED metrics without per-handler boilerplate.
✅ Summary¶
I can test the core with fakes and the gRPC edge with bufconn, write deterministic concurrency tests that pass go test -race, and expose RED-style counters in the Prometheus exposition format with safe, low-cardinality, copy-on-snapshot semantics.
⏭️ Next Steps / Prep for Tomorrow¶
- Day 139: write the docs — package docs/godoc, a project README, and Architecture Decision Records.
| Time spent | Difficulty | Confidence |
|---|---|---|
| 90 min | 🟦🟦⬜⬜⬜ | 🟦🟦🟦⬜⬜ |
Suggested commit: test(worker): race-clean tests + prometheus-style metrics (day 138)