Skip to content

Day 142 — Prometheus Metrics (client_golang)

Month 6 · Week 1 · ⬅ Day 141 · Day 143 ➡ · Journal index

🎯 Learning Objective

Instrument a Go service with the four Prometheus metric types and expose them for scraping, using prometheus/client_golang correctly (registration, labels, histograms, the RED method).

📚 Topics

  • Counter · Gauge · Histogram · Summary; labels & cardinality
  • promauto, the default registry, promhttp.Handler, exposition format

📖 Reading / Sources

📝 Notes

  • Four metric types → [[metrics]]:
  • Counter — monotonically increasing total (requests, errors). Only Inc/Add; queried with rate(). Never goes down except on process restart (a reset, which rate() handles).
  • Gauge — a value that goes up and down (in-flight requests, queue depth, temperature). Set/Inc/Dec.
  • Histogram — bucketed observations (latency, payload size). Pre-defined buckets; gives you _count, _sum, and _bucket{le=...}; quantiles computed server-side with histogram_quantile() → aggregatable across instances.
  • Summary — client-side quantiles; cannot be aggregated across instances. Prefer histograms unless you need an exact local quantile.
  • Labels add a time series per combination. Keep label values bounded — never put user IDs, emails, or raw URLs in a label, or you get cardinality explosion → [[cardinality]]. Use the route template (/users/{id}), not the concrete path.
  • Every metric must be registered exactly once. promauto registers on creation; double-registering the same name panics. Define metrics as package vars in an init/metrics.go.
  • Expose with promhttp.Handler() at /metrics; Prometheus scrapes it on an interval. The body is the text exposition format (# HELP, # TYPE, name{label="v"} value).
  • RED method for request-driven services: Rate, Errors, Duration — a counter for requests, a counter (or label) for errors, a histogram for latency. (USE — Utilization/Saturation/Errors — is the resource-side counterpart.)
  • Histogram buckets are cumulative (le = "less than or equal"); choose buckets around your SLO (e.g. 5ms…2.5s), not the defaults, for latency.

💻 Code Examples

client_golang is third-party, so this is a snippet (no runnable stdlib example). The mechanics of a counter registry + exposition format are rebuilt with the stdlib in examples/month-05/metrics.

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequests = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total HTTP requests.",
    }, []string{"route", "method", "code"}) // bounded label values only

    httpDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Help:    "Request latency.",
        Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5}, // SLO-shaped
    }, []string{"route"})
)

func instrument(route string, next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        sr := &statusRecorder{ResponseWriter: w, code: 200}
        next.ServeHTTP(sr, r)
        httpDuration.WithLabelValues(route).Observe(time.Since(start).Seconds())
        httpRequests.WithLabelValues(route, r.Method, strconv.Itoa(sr.code)).Inc()
    })
}

http.Handle("/metrics", promhttp.Handler()) // scraped by Prometheus

🏋️ Exercises / Practice

Exercise Status Link
(concept) rebuild a counter registry + exposition format examples/month-05/metrics

🐛 Mistakes Made

  • Put the raw request path (with IDs) in a label → cardinality blew up. Switched to the route template.
  • Reached for a Summary for latency; learned histograms aggregate across replicas and Summaries don't.

❓ Open Questions

  • Native histograms (the newer sparse-bucket type) vs classic fixed buckets — when to switch?

🧠 Active Recall (answer without looking)

  1. Q: Why prefer a Histogram over a Summary for request latency in a replicated service?
    A

Histogram buckets are exposed raw and combined server-side with histogram_quantile(), so you can aggregate across all replicas. Summary quantiles are computed client-side per instance and cannot be averaged/merged meaningfully. 2. Q: What's the danger of using a user ID as a metric label value?

A

Cardinality explosion: each distinct label value is a separate time series, so unbounded values create millions of series and OOM the scraper/TSDB. Labels must have bounded, low-cardinality values.

🪶 Feynman Reflection

A counter only climbs (you ask Prometheus for its rate); a gauge is a dial that moves both ways; a histogram drops each measurement into a bucket so you can ask "what fraction was under 100ms?" later. Labels slice each metric into separate lines — powerful, but each new value is a new line, so keep them few and bounded.

🕳️ Knowledge Gaps

  • Exemplars (linking a histogram sample to a trace ID) — ties into Day 143 tracing.

✅ Summary

I can choose the right metric type, instrument the RED signals with bounded labels, register metrics once, and expose /metrics for scraping.

⏭️ Next Steps / Prep for Tomorrow

  • Day 143: propagate a trace across services with OpenTelemetry and the W3C traceparent header.

Time spent Difficulty Confidence
90 min 🟦🟦⬜⬜⬜ 🟦🟦🟦⬜⬜

Suggested commit: docs(journal): prometheus metrics and the RED method (day 142)