Skip to content

Day 146 — Dashboards & Alerting Basics

Month 6 · Week 1 · ⬅ Day 145 · Day 147 ➡ · Journal index

🎯 Learning Objective

Turn raw metrics into useful dashboards and actionable alerts: the golden signals, basic PromQL, SLO-based alerting, and how to avoid alert fatigue.

📚 Topics

  • Four golden signals; PromQL rate/histogram_quantile; recording rules
  • Alerting on symptoms vs. causes; for: duration; burn-rate alerts

📖 Reading / Sources

📝 Notes

  • Four golden signals to put on every service dashboard → [[golden-signals]]: Latency (and separate success vs error latency), Traffic (rate), Errors (rate/ratio), Saturation (how full — CPU, memory, queue depth). RED (Day 142) is the request-side subset.
  • Core PromQL:
  • rate(http_requests_total[5m]) — per-second request rate over a 5m window (counters are queried as rates, never raw).
  • error ratio: sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])).
  • p99 latency: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) — note the by (le) and that you rate the _bucket series.
  • Recording rules precompute expensive expressions on a schedule so dashboards/alerts read a cheap pre-aggregated series.
  • Alert on symptoms, not causes. Page on "users see errors / latency is high" (user-visible), not "CPU is 90%" (a cause that may be harmless). Cause metrics belong on dashboards for debugging, not pagers.
  • for: holds a condition true for a duration before firing, killing flapping. Severity routing: page (wake a human, user impact now) vs ticket (look at it tomorrow).
  • SLO / error-budget alerting: define an SLO (e.g. 99.9% success), alert on burn rate — how fast you're consuming the error budget. Multi-window burn-rate (fast 1h and slow 6h windows) catches both sudden outages and slow bleeds with few false pages.
  • Avoid alert fatigue: every alert must be actionable and have a runbook; delete alerts nobody acts on. A noisy pager trains people to ignore it.
  • Dashboard hygiene: top row = the golden signals; use template variables for instance/route; show error ratio, not just count; annotate deploys.

💻 Code Examples

Alerting is config (PromQL + YAML), not Go — a representative rule:

groups:
  - name: slo
    rules:
      # Symptom-based, SLO burn-rate alert: error ratio > 5% sustained 10m.
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{code=~"5.."}[5m]))
            / sum(rate(http_requests_total[5m])) > 0.05
        for: 10m
        labels: { severity: page }
        annotations:
          summary: "5xx error ratio above 5% for 10m"
          runbook: "https://runbooks/internal/high-error-rate"

🏋️ Exercises / Practice

Exercise Status Link
(none — config/PromQL day; reuse week exercises) exercises/month-06/week-1

🐛 Mistakes Made

  • Alerted on raw counter value instead of rate() → meaningless. Counters must be rated.
  • Paged on high CPU (a cause) → noisy, non-actionable. Switched to symptom-based error-ratio alerts.

❓ Open Questions

  • How to pick burn-rate windows/thresholds for a 99.9% vs 99.99% SLO without over- or under-paging?

🧠 Active Recall (answer without looking)

  1. Q: Why alert on symptoms (errors/latency) rather than causes (CPU)?
    A

Symptoms map to user impact and are always worth acting on; causes are often harmless (high CPU at full utilization can be fine) and create non-actionable, fatigue-inducing pages. Causes belong on dashboards for diagnosis. 2. Q: Write p99 latency from a histogram in PromQL (sketch).

A

histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) — rate the _bucket series, sum by (le), then apply histogram_quantile.

🪶 Feynman Reflection

A dashboard answers "is the service healthy right now?" with the four golden signals; an alert answers "should a human wake up?" Good alerts fire on what users feel (errors, slowness) and burn-rate against a promise (the SLO), not on internal numbers that wobble harmlessly.

🕳️ Knowledge Gaps

  • Hands-on multi-window burn-rate rules and Alertmanager routing/inhibition.

✅ Summary

I can map metrics to the golden signals, write basic PromQL (rate, error ratio, histogram quantiles), and design symptom-/SLO-based alerts that stay actionable.

⏭️ Next Steps / Prep for Tomorrow

  • Day 147: week review + spaced-repetition recall across the observability stack.

Time spent Difficulty Confidence
90 min 🟦🟦⬜⬜⬜ 🟦🟦🟦⬜⬜

Suggested commit: docs(journal): dashboards and SLO alerting basics (day 146)