Skip to content

Table of Contents

ADR-0004: Observability triad + persistence/caching stack

  • Status: Accepted
  • Date: 2026-06-26
  • Deciders: Backend team

Context

A multi-tenant backend needs to answer, for any reported issue, what happened, how often, and where the time went — and to do so per request and per tenant. It also needs a persistence and caching stack that is idiomatic, fast on the tenant-scoped access paths, and safe to evolve.

Decision

Observability — the three signals, correlated by request_id/trace_id:

  • Logs: log/slog structured JSON (text in dev). One line per request with method, route, status, duration, request_id, and (when authenticated) user_id/tenant_id/role.
  • Metrics: prometheus/client_golang RED metrics (requests_total, request_duration_seconds, in_flight_requests) labelled by the low-cardinality chi route template (never the raw path), plus Go/process collectors, on a private registry served at /metrics.
  • Traces: OpenTelemetry with an OTLP/gRPC exporter to Jaeger; the whole router is wrapped with otelhttp so every request is a server span and context propagates inward. Tracing is a no-op when no endpoint is configured.

Persistence & caching:

  • pgx/pgxpool as the driver/pool (no ORM): explicit, parameterized SQL; fast; first-class context and pgconn.PgError handling (23505ErrConflict). A TxManager puts the active pgx.Tx in the context so use-cases compose multiple repository calls atomically (e.g. signup).
  • golang-migrate SQL migrations (*.up.sql/*.down.sql), run as a separate one-shot step/init container — never inside the request path — to enable expand/contract, zero-downtime changes.
  • Keyset (cursor) pagination on (created_at DESC, id DESC) with a matching composite index per tenant access path — stable and O(1) regardless of depth.
  • Redis cache-aside for the hot default listings, with write-through invalidation (a write deletes the tenant/project list key). Only the hot, unfiltered first page is cached, keeping invalidation exact and correct.
  • Redis token-bucket rate limiter (atomic Lua) shared across replicas, with an in-memory fallback for local/dev.
  • Async side-effects go through an EventPublisher port (today a logging publisher; tomorrow a Redis/asynq queue + worker), keeping slow work off the request path.

Consequences

Positive

  • Given only a request_id, an operator can pivot from a log line to its trace and see which span (middleware, use-case, repo, DB) spent the time, while metrics show whether it is systemic.
  • pgx + explicit SQL keeps queries reviewable and the tenant predicate visible; keyset pagination and per-tenant composite indexes hold read latency low.
  • Cache-aside with precise invalidation avoids stale reads; the limiter is correct across horizontally-scaled instances.

Negative / costs

  • Three signals to operate (log pipeline, Prometheus, Jaeger) and a Redis dependency. Mitigated by graceful degradation: tracing and cache are optional; the app runs (slower) without Redis.
  • Cache-aside only accelerates the hot path; filtered/paged queries hit Postgres directly (an accepted, measured trade-off).

Neutral

  • The EventPublisher seam means moving email/webhook/notification work to a dedicated worker later is an adapter swap, not a refactor.