Day 144 — Health & Readiness Endpoints¶
Month 6 · Week 1 · ⬅ Day 143 · Day 145 ➡ · Journal index
🎯 Learning Objective¶
Implement the three production health probes — liveness, readiness, startup — with correct semantics, dependency aggregation, and graceful traffic draining.
📚 Topics¶
/healthzvs/readyzvs startup; what each failure does- Dependency checks with per-probe timeouts; readiness flip on shutdown
📖 Reading / Sources¶
- Kubernetes — Configure Liveness, Readiness and Startup Probes
-
net/http.Server.Shutdown - Google SRE — Handling overload / health checking
📝 Notes¶
- Three probes, three consequences → [[health-checks]]:
- Liveness (
/healthz) — "is the process wedged?" Failure ⇒ the orchestrator restarts the pod. Keep it dumb: return 200 if the HTTP server can answer. Never check dependencies here — a slow DB would trigger a restart loop that fixes nothing. - Readiness (
/readyz) — "should I get traffic right now?" This does check dependencies. Failure ⇒ the pod is pulled from the load balancer with no restart — exactly right for a transient dependency blip or during shutdown. - Startup — guards slow-booting apps: liveness/readiness don't run until startup passes, so a long warmup isn't mistaken for a hang.
- Each readiness check needs its own timeout via
context.WithTimeout, so one hung dependency can't stall the whole probe → [[context]]. Run them, collect results, report the worst. - Graceful shutdown order matters: on
SIGTERM, first flip readiness to false (so the LB drains new traffic), wait a beat, thensrv.Shutdown(ctx)to finish in-flight requests. Reverse order drops requests. - Drive shutdown from
signal.NotifyContext(ctx, os.Interrupt, syscall.SIGTERM);Shutdownstops listeners and waits for active conns up to the ctx deadline (then force-close). - Return JSON with a per-component breakdown (
{"status":"unavailable","components":{"postgres":"error: ..."}}) so dashboards and humans can see which dep failed. Status code is the contract: 200 ready, 503 not ready. - Don't make readiness too sensitive — a single optional cache being down shouldn't pull the whole service. Distinguish hard deps (DB) from soft ones (degraded).
💻 Code Examples¶
// Readiness aggregates dependency checks, each with its own deadline.
func (c *Checker) Readyz(w http.ResponseWriter, r *http.Request) {
if c.shuttingDown.Load() { // flipped first on SIGTERM so the LB drains
http.Error(w, `{"status":"unavailable"}`, http.StatusServiceUnavailable)
return
}
healthy := true
for name, check := range c.checks {
ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
if err := check(ctx); err != nil {
healthy = false // one hard dep down ⇒ NotReady (503), no restart
}
cancel()
_ = name
}
if !healthy {
w.WriteHeader(http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
}
Full code:
examples/month-06/healthcheck/main.go· Run:go run ./examples/month-06/healthcheck
🏋️ Exercises / Practice¶
| Exercise | Status | Link |
|---|---|---|
health — readiness aggregator (worst-wins) |
✅ | exercises/month-06/week-1/health |
🐛 Mistakes Made¶
- First put the DB check inside
/healthz→ a DB blip caused restart storms. Moved dep checks to/readyz. - Shut the server down before flipping readiness, dropping in-flight requests. Fixed the order: flip readiness → drain →
Shutdown.
❓ Open Questions¶
- Should readiness checks cache their result briefly to avoid hammering the DB when probes are frequent? (Yes — cache for a few seconds.)
🧠 Active Recall (answer without looking)¶
- Q: Why must liveness not check the database?
A
A failing liveness probe causes a restart. A DB outage isn't fixed by restarting the app, so dep-checking liveness creates pointless restart storms. Liveness should only prove the process itself can serve.
2. Q: In what order do you flip readiness vs. call srv.Shutdown on SIGTERM, and why? A
Flip readiness to false first so the load balancer stops sending new traffic, give it a moment to drain, then call Shutdown to finish in-flight requests. Reversing it drops requests that arrive during the gap.
🪶 Feynman Reflection¶
Liveness is "is the patient alive?" — if not, restart. Readiness is "is the patient ready to take visitors?" — if not, send visitors elsewhere but don't restart. Keeping those two questions separate is what prevents a sneeze (a brief DB blip) from triggering surgery (a restart loop).
🕳️ Knowledge Gaps¶
- Probe tuning (
initialDelaySeconds,failureThreshold) for slow-starting JVM-like services — N/A for Go but worth knowing.
✅ Summary¶
I can implement liveness/readiness/startup with correct consequences, aggregate dependency checks with per-probe timeouts, and drain traffic gracefully on shutdown.
⏭️ Next Steps / Prep for Tomorrow¶
- Day 145: thread a correlation ID through context and into every log line.
| Time spent | Difficulty | Confidence |
|---|---|---|
| 90 min | 🟦🟦⬜⬜⬜ | 🟦🟦🟦🟦⬜ |
Suggested commit: feat(examples): liveness/readiness probes with graceful drain (day 144)