Day 144 — Health & Readiness Endpoints¶

Month 6 · Week 1 · ⬅ Day 143 · Day 145 ➡ · Journal index

🎯 Learning Objective¶

Implement the three production health probes — liveness, readiness, startup — with correct semantics, dependency aggregation, and graceful traffic draining.

📚 Topics¶

/healthz vs /readyz vs startup; what each failure does
Dependency checks with per-probe timeouts; readiness flip on shutdown

📖 Reading / Sources¶

📝 Notes¶

Three probes, three consequences → [[health-checks]]:
Liveness (/healthz) — "is the process wedged?" Failure ⇒ the orchestrator restarts the pod. Keep it dumb: return 200 if the HTTP server can answer. Never check dependencies here — a slow DB would trigger a restart loop that fixes nothing.
Readiness (/readyz) — "should I get traffic right now?" This does check dependencies. Failure ⇒ the pod is pulled from the load balancer with no restart — exactly right for a transient dependency blip or during shutdown.
Startup — guards slow-booting apps: liveness/readiness don't run until startup passes, so a long warmup isn't mistaken for a hang.
Each readiness check needs its own timeout via context.WithTimeout, so one hung dependency can't stall the whole probe → [[context]]. Run them, collect results, report the worst.
Graceful shutdown order matters: on SIGTERM, first flip readiness to false (so the LB drains new traffic), wait a beat, then srv.Shutdown(ctx) to finish in-flight requests. Reverse order drops requests.
Drive shutdown from signal.NotifyContext(ctx, os.Interrupt, syscall.SIGTERM); Shutdown stops listeners and waits for active conns up to the ctx deadline (then force-close).
Return JSON with a per-component breakdown ({"status":"unavailable","components":{"postgres":"error: ..."}}) so dashboards and humans can see which dep failed. Status code is the contract: 200 ready, 503 not ready.
Don't make readiness too sensitive — a single optional cache being down shouldn't pull the whole service. Distinguish hard deps (DB) from soft ones (degraded).

💻 Code Examples¶

// Readiness aggregates dependency checks, each with its own deadline.
func (c *Checker) Readyz(w http.ResponseWriter, r *http.Request) {
    if c.shuttingDown.Load() { // flipped first on SIGTERM so the LB drains
        http.Error(w, `{"status":"unavailable"}`, http.StatusServiceUnavailable)
        return
    }
    healthy := true
    for name, check := range c.checks {
        ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
        if err := check(ctx); err != nil {
            healthy = false // one hard dep down ⇒ NotReady (503), no restart
        }
        cancel()
        _ = name
    }
    if !healthy {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    w.WriteHeader(http.StatusOK)
}

Full code: examples/month-06/healthcheck/main.go · Run: go run ./examples/month-06/healthcheck

🏋️ Exercises / Practice¶

Exercise	Status	Link
`health` — readiness aggregator (worst-wins)	✅	exercises/month-06/week-1/health

🐛 Mistakes Made¶

First put the DB check inside /healthz → a DB blip caused restart storms. Moved dep checks to /readyz.
Shut the server down before flipping readiness, dropping in-flight requests. Fixed the order: flip readiness → drain → Shutdown.

❓ Open Questions¶

Should readiness checks cache their result briefly to avoid hammering the DB when probes are frequent? (Yes — cache for a few seconds.)

🧠 Active Recall (answer without looking)¶

Q: Why must liveness not check the database?
A

A failing liveness probe causes a restart. A DB outage isn't fixed by restarting the app, so dep-checking liveness creates pointless restart storms. Liveness should only prove the process itself can serve. 2. Q: In what order do you flip readiness vs. call srv.Shutdown on SIGTERM, and why?

A

Flip readiness to false first so the load balancer stops sending new traffic, give it a moment to drain, then call Shutdown to finish in-flight requests. Reversing it drops requests that arrive during the gap.

🪶 Feynman Reflection¶

Liveness is "is the patient alive?" — if not, restart. Readiness is "is the patient ready to take visitors?" — if not, send visitors elsewhere but don't restart. Keeping those two questions separate is what prevents a sneeze (a brief DB blip) from triggering surgery (a restart loop).

🕳️ Knowledge Gaps¶

Probe tuning (initialDelaySeconds, failureThreshold) for slow-starting JVM-like services — N/A for Go but worth knowing.

✅ Summary¶

I can implement liveness/readiness/startup with correct consequences, aggregate dependency checks with per-probe timeouts, and drain traffic gracefully on shutdown.

⏭️ Next Steps / Prep for Tomorrow¶

Day 145: thread a correlation ID through context and into every log line.

Time spent	Difficulty	Confidence
90 min	🟦🟦⬜⬜⬜	🟦🟦🟦🟦⬜

Suggested commit: feat(examples): liveness/readiness probes with graceful drain (day 144)