Day 146 — Dashboards & Alerting Basics¶
Month 6 · Week 1 · ⬅ Day 145 · Day 147 ➡ · Journal index
🎯 Learning Objective¶
Turn raw metrics into useful dashboards and actionable alerts: the golden signals, basic PromQL, SLO-based alerting, and how to avoid alert fatigue.
📚 Topics¶
- Four golden signals; PromQL
rate/histogram_quantile; recording rules - Alerting on symptoms vs. causes;
for:duration; burn-rate alerts
📖 Reading / Sources¶
- Google SRE — Monitoring distributed systems (four golden signals)
- PromQL basics
- Prometheus alerting rules
- Google SRE Workbook — Alerting on SLOs (burn rate)
📝 Notes¶
- Four golden signals to put on every service dashboard → [[golden-signals]]: Latency (and separate success vs error latency), Traffic (rate), Errors (rate/ratio), Saturation (how full — CPU, memory, queue depth). RED (Day 142) is the request-side subset.
- Core PromQL:
rate(http_requests_total[5m])— per-second request rate over a 5m window (counters are queried as rates, never raw).- error ratio:
sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])). - p99 latency:
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))— note theby (le)and that you rate the_bucketseries. - Recording rules precompute expensive expressions on a schedule so dashboards/alerts read a cheap pre-aggregated series.
- Alert on symptoms, not causes. Page on "users see errors / latency is high" (user-visible), not "CPU is 90%" (a cause that may be harmless). Cause metrics belong on dashboards for debugging, not pagers.
for:holds a condition true for a duration before firing, killing flapping. Severity routing:page(wake a human, user impact now) vsticket(look at it tomorrow).- SLO / error-budget alerting: define an SLO (e.g. 99.9% success), alert on burn rate — how fast you're consuming the error budget. Multi-window burn-rate (fast 1h and slow 6h windows) catches both sudden outages and slow bleeds with few false pages.
- Avoid alert fatigue: every alert must be actionable and have a runbook; delete alerts nobody acts on. A noisy pager trains people to ignore it.
- Dashboard hygiene: top row = the golden signals; use template variables for
instance/route; show error ratio, not just count; annotate deploys.
💻 Code Examples¶
Alerting is config (PromQL + YAML), not Go — a representative rule:
groups:
- name: slo
rules:
# Symptom-based, SLO burn-rate alert: error ratio > 5% sustained 10m.
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{code=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 10m
labels: { severity: page }
annotations:
summary: "5xx error ratio above 5% for 10m"
runbook: "https://runbooks/internal/high-error-rate"
🏋️ Exercises / Practice¶
| Exercise | Status | Link |
|---|---|---|
| (none — config/PromQL day; reuse week exercises) | — | exercises/month-06/week-1 |
🐛 Mistakes Made¶
- Alerted on raw counter value instead of
rate()→ meaningless. Counters must be rated. - Paged on high CPU (a cause) → noisy, non-actionable. Switched to symptom-based error-ratio alerts.
❓ Open Questions¶
- How to pick burn-rate windows/thresholds for a 99.9% vs 99.99% SLO without over- or under-paging?
🧠 Active Recall (answer without looking)¶
- Q: Why alert on symptoms (errors/latency) rather than causes (CPU)?
A
Symptoms map to user impact and are always worth acting on; causes are often harmless (high CPU at full utilization can be fine) and create non-actionable, fatigue-inducing pages. Causes belong on dashboards for diagnosis.
2. Q: Write p99 latency from a histogram in PromQL (sketch). A
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) — rate the _bucket series, sum by (le), then apply histogram_quantile.
🪶 Feynman Reflection¶
A dashboard answers "is the service healthy right now?" with the four golden signals; an alert answers "should a human wake up?" Good alerts fire on what users feel (errors, slowness) and burn-rate against a promise (the SLO), not on internal numbers that wobble harmlessly.
🕳️ Knowledge Gaps¶
- Hands-on multi-window burn-rate rules and Alertmanager routing/inhibition.
✅ Summary¶
I can map metrics to the golden signals, write basic PromQL (rate, error ratio, histogram quantiles), and design symptom-/SLO-based alerts that stay actionable.
⏭️ Next Steps / Prep for Tomorrow¶
- Day 147: week review + spaced-repetition recall across the observability stack.
| Time spent | Difficulty | Confidence |
|---|---|---|
| 90 min | 🟦🟦⬜⬜⬜ | 🟦🟦🟦⬜⬜ |
Suggested commit: docs(journal): dashboards and SLO alerting basics (day 146)