Table of Contents
ADR-0004: Observability triad + persistence/caching stack¶
- Status: Accepted
- Date: 2026-06-26
- Deciders: Backend team
Context¶
A multi-tenant backend needs to answer, for any reported issue, what happened, how often, and where the time went — and to do so per request and per tenant. It also needs a persistence and caching stack that is idiomatic, fast on the tenant-scoped access paths, and safe to evolve.
Decision¶
Observability — the three signals, correlated by request_id/trace_id:
- Logs:
log/slogstructured JSON (text in dev). One line per request with method, route, status, duration,request_id, and (when authenticated)user_id/tenant_id/role. - Metrics:
prometheus/client_golangRED metrics (requests_total,request_duration_seconds,in_flight_requests) labelled by the low-cardinality chi route template (never the raw path), plus Go/process collectors, on a private registry served at/metrics. - Traces: OpenTelemetry with an OTLP/gRPC exporter to Jaeger; the whole
router is wrapped with
otelhttpso every request is a server span and context propagates inward. Tracing is a no-op when no endpoint is configured.
Persistence & caching:
- pgx/pgxpool as the driver/pool (no ORM): explicit, parameterized SQL;
fast; first-class
contextandpgconn.PgErrorhandling (23505→ErrConflict). ATxManagerputs the activepgx.Txin the context so use-cases compose multiple repository calls atomically (e.g. signup). - golang-migrate SQL migrations (
*.up.sql/*.down.sql), run as a separate one-shot step/init container — never inside the request path — to enable expand/contract, zero-downtime changes. - Keyset (cursor) pagination on
(created_at DESC, id DESC)with a matching composite index per tenant access path — stable and O(1) regardless of depth. - Redis cache-aside for the hot default listings, with write-through invalidation (a write deletes the tenant/project list key). Only the hot, unfiltered first page is cached, keeping invalidation exact and correct.
- Redis token-bucket rate limiter (atomic Lua) shared across replicas, with an in-memory fallback for local/dev.
- Async side-effects go through an
EventPublisherport (today a logging publisher; tomorrow a Redis/asynq queue + worker), keeping slow work off the request path.
Consequences¶
Positive
- Given only a
request_id, an operator can pivot from a log line to its trace and see which span (middleware, use-case, repo, DB) spent the time, while metrics show whether it is systemic. - pgx + explicit SQL keeps queries reviewable and the tenant predicate visible; keyset pagination and per-tenant composite indexes hold read latency low.
- Cache-aside with precise invalidation avoids stale reads; the limiter is correct across horizontally-scaled instances.
Negative / costs
- Three signals to operate (log pipeline, Prometheus, Jaeger) and a Redis dependency. Mitigated by graceful degradation: tracing and cache are optional; the app runs (slower) without Redis.
- Cache-aside only accelerates the hot path; filtered/paged queries hit Postgres directly (an accepted, measured trade-off).
Neutral
- The
EventPublisherseam means moving email/webhook/notification work to a dedicated worker later is an adapter swap, not a refactor.