Skip to content

Day 165 — Capstone: Observability & Deploy

Month 6 · Week 4 · ⬅ Day 164 · Day 166 ➡ · Journal index

🎯 Learning Objective

Make linkr observable (structured logs, RED metrics, distributed traces) and deployable (multi-stage image, compose stack, health probes, graceful shutdown) — wiring together everything from Weeks 1–2.

📚 Topics

  • The three pillars: log/slog · Prometheus metrics · OpenTelemetry traces
  • Deploy: distroless image · /healthz vs /readyz · signal.NotifyContext + Shutdown

📖 Reading / Sources

📝 Notes

  • Observability is middleware. Logging, metrics, and tracing bolt on as HTTP middleware / gRPC interceptors — the service stays clean. Each request flows through: traceID extract → log with context → time the handler → record metrics → propagate trace → [[http-middleware]].
  • slog is the spine: one JSON logger, With(slog.String("request_id", id)) per request, SetDefault so library code participates → [[structured-logging]].
  • RED metrics (Rate, Errors, Duration) per endpoint: a CounterVec for requests by code, a HistogramVec for latency. Expose /metrics; Prometheus scrapes it → [[metrics]].
  • Tracing propagates a W3C traceparent across REST→gRPC→DB hops; the OTel SDK exports spans to a collector. The trace-id ties logs, metrics, and spans together → [[trace-context]].
  • One correlation id threads through all three pillars: middleware mints/extracts it, stuffs it in the context, and a context-reading log handler stamps every line → [[correlation-id]].
  • Deploy lifecycle (from Week 2): static CGO_ENABLED=0 binary → distroless static:nonroot image → compose stack (app + postgres + redis) → /healthz (liveness, no deps) vs /readyz (readiness, checks DB+cache) → signal.NotifyContext(SIGINT,SIGTERM) + srv.Shutdown(ctx) drains in-flight requests → [[graceful-shutdown]].
  • Readiness flips to false at shutdown before draining, so the load balancer stops sending new traffic while in-flight requests finish.

💻 Code Examples

Observability middleware shape (RED + logging + trace), stdlib-expressible:

func observe(reg *Metrics, next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        rid := requestID(r) // extract or mint a correlation id
        ctx := context.WithValue(r.Context(), ridKey{}, rid)

        sw := &statusWriter{ResponseWriter: w, code: 200} // capture status
        next.ServeHTTP(sw, r.WithContext(ctx))

        slog.Info("request",
            slog.String("request_id", rid),
            slog.String("method", r.Method),
            slog.String("path", r.URL.Path),
            slog.Int("status", sw.code),
            slog.Duration("dur", time.Since(start)),
        )
        reg.observe(r.URL.Path, sw.code, time.Since(start)) // RED metrics
    })
}

The correlation/trace/health building blocks are runnable from Week 1: examples/month-06/correlation, examples/month-06/healthcheck, examples/month-06/graceful.

Prometheus instrumentation (third-party — snippet only):

var reqDur = prometheus.NewHistogramVec(prometheus.HistogramOpts{
    Name:    "http_request_duration_seconds",
    Buckets: prometheus.DefBuckets,
}, []string{"route", "code"})

func init() { prometheus.MustRegister(reqDur) }
// in middleware: reqDur.WithLabelValues(route, strconv.Itoa(code)).Observe(dur.Seconds())
// mux.Handle("/metrics", promhttp.Handler())

Dockerfile (deploy — config, not Go; snippet only):

FROM golang:1.22 AS build
WORKDIR /src
COPY . .
RUN CGO_ENABLED=0 go build -ldflags="-s -w" -o /linkr ./cmd/linkr

FROM gcr.io/distroless/static:nonroot
COPY --from=build /linkr /linkr
USER nonroot:nonroot
ENTRYPOINT ["/linkr"]

🏋️ Exercises / Practice

Exercise Status Link
Reuse Week-1 correlation/healthcheck building blocks examples/month-06/correlation
Reuse Week-2 graceful shutdown pattern examples/month-06/graceful

🐛 Mistakes Made

  • Put DB connectivity checks in /healthz → a transient DB blip got the container killed by the orchestrator. Liveness must check only the process; dependency checks belong in /readyz.
  • Logged the raw Authorization header at debug → leaked a token. Added a ReplaceAttr redaction hook (from [[day-141]]).
  • High-cardinality metric label (the full path with the {code} interpolated) blew up Prometheus memory. Use the route pattern (/links/{code}), not the concrete path.

❓ Open Questions

  • Sampling rate for traces in "prod" — 100% is fine for the demo but unrealistic at scale; tail-based sampling is the real answer.

🧠 Active Recall (answer without looking)

  1. Q: Why must liveness (/healthz) ignore dependencies that readiness (/readyz) checks?
    A

Liveness answers "is the process wedged?" — a false here makes the orchestrator restart the container. If liveness checked the DB, a transient DB outage would trigger pointless restarts that don't fix anything. Readiness answers "should I get traffic right now?" — a false there just removes the pod from the load balancer until the dependency recovers. 2. Q: What ties a log line, a metric, and a span together for one request?

A

A shared correlation/trace id carried in the request context. Middleware extracts or mints it, the log handler stamps every line with it, and the tracer uses it as the trace-id propagated via the W3C traceparent header — so you can pivot from a slow metric to its logs to its trace.

🪶 Feynman Reflection

Observability is answering "what is my service doing right now?" from outside the process. Logs tell the story of one request, metrics aggregate the health of all requests, traces show one request's path across services — and a single correlation id is the thread that lets me jump between all three. Deploy is the other half: package the binary tiny and locked-down, tell the orchestrator how to know it's alive vs ready, and promise to finish in-flight work before exiting.

🕳️ Knowledge Gaps

  • Exemplars (linking a Prometheus histogram bucket to a trace id) — powerful but I haven't wired them; note for later.

✅ Summary

linkr is now observable (slog + RED metrics + OTel traces stitched by one correlation id) and deployable (distroless image, liveness/readiness split, graceful shutdown), composing the whole month's production skills.

⏭️ Next Steps / Prep for Tomorrow

  • Day 166: integration tests across the real stack (DB + cache) with testcontainers and httptest.

Time spent Difficulty Confidence
90 min 🟦🟦🟦⬜⬜ 🟦🟦🟦⬜⬜

Suggested commit: feat(linkr): observability (slog/metrics/traces) + deploy (docker/health/shutdown) (day 165)