Table of Contents
- 06 — Redis Job Queue
- Overview
- Learning Objectives
- Requirements
- Architecture
- Suggested Project Layout
- Data Model / Database
- API Design
- Tech Stack
- Implementation Milestones
- Testing Strategy
- Deployment
- Documentation Deliverables
- Stretch Goals / Future Improvements
- Lessons-Learned Prompts
- Portfolio & Resume
06 — Redis Job Queue¶
Overview¶
This project builds a Redis-backed background job / worker system in Go: a producer/consumer task queue that lets you enqueue typed jobs from anywhere in your application and process them asynchronously on a pool of workers. It is the kind of infrastructure that sits behind "send the welcome email", "resize the uploaded image", "charge the customer", or "rebuild the search index" — work that must happen, but not on the request path.
The system is intentionally modeled on real production tooling. The canonical
open-source reference is hibiken/asynq, a
mature Redis-backed task queue used in production by many Go shops; other
references include Sidekiq (Ruby), Celery (Python), and River (Postgres-backed
Go). You will build a simplified version of this yourself so that you
understand, line by line, how reliability, retries, scheduling, and recovery
actually work — rather than treating the queue as a black box.
The hard part of a job queue is not "push a message onto a list and pop it off the other end". The hard part is everything around the edges:
- What happens when a worker crashes while holding a job? (recovery)
- What happens when a handler fails? (retries with backoff, then a DLQ)
- What happens when the same job is delivered twice? (idempotency)
- How do you run a job at 3am next Tuesday? (delayed / scheduled jobs)
- How do you know the system is healthy and keeping up? (metrics)
- How do you deploy a new version without dropping in-flight work? (graceful drain)
By the end you will have a small library (pkg/jobqueue) with an ergonomic
Client.Enqueue / Server.Register / Server.Run API, a worker binary, a
producer/enqueuer CLI, Prometheus metrics, and a Docker Compose stack that runs
Redis + Prometheus + Grafana.
Level: Advanced. This assumes you are comfortable with goroutines, channels,
context, interfaces, and encoding/json, and that you have used Redis before
(or are willing to learn its data structures as you go).
Learning Objectives¶
By completing this project you will be able to:
- Design a reliable queue on top of Redis that does not lose jobs when a worker process dies mid-flight.
- Explain and implement at-least-once delivery and why exactly-once is (practically) impossible without idempotent consumers.
- Write idempotent handlers and use Redis-based dedup keys to deduplicate work.
- Implement exponential backoff with jitter for retries, and a dead-letter queue (DLQ) for jobs that exhaust their attempts.
- Implement delayed / scheduled jobs using a Redis sorted set keyed by run-at timestamp, plus a scheduler goroutine that promotes due jobs.
- Implement stale-job recovery (re-queue jobs orphaned by a crashed worker)
using either an
LREM-from-processing pattern or Redis StreamsXAUTOCLAIM. - Build a bounded worker pool with backpressure and graceful shutdown that drains in-flight work before exiting.
- Instrument everything with Prometheus metrics: counters for processed / failed / retried jobs, gauges for queue depth, and a histogram for processing latency.
- Compare the two dominant Redis queue designs — Lists (
BRPOPLPUSH) vs Streams (consumer groups +XACK/XAUTOCLAIM) — and justify a choice in an ADR. - Test distributed-systems code with
miniredis(unit) andtestcontainers-go(integration), including a deliberate crash-recovery test.
Requirements¶
Functional¶
- Enqueue a job with a type + payload. A producer calls
client.Enqueue(ctx, NewTask("email:welcome", payload)). Thetypestring routes the job to a handler; thepayloadis opaque bytes (JSON). - Register typed handlers. A worker registers handlers by type, e.g.
mux.Handle("email:welcome", welcomeHandler). Unknown types are an error (job is failed/retried or sent to DLQ, per policy). - Retries with a max-attempts cap. When a handler returns an error, the job
is retried up to
MaxRetriestimes with exponential backoff + jitter between attempts. Each attempt incrementsAttemptand recordsLastError. - Dead-letter queue (DLQ). When a job exhausts
MaxRetries, it is moved to a DLQ (a Redis list/stream + metadata hash) rather than silently dropped, so it can be inspected, fixed, and optionally re-enqueued. - Scheduled / delayed jobs.
WithDelay(d)orWithRunAt(t)schedules a job for future execution. Delayed jobs live in a sorted set scored by run-at; a scheduler goroutine promotes due jobs to the pending queue. - Unique / idempotent jobs.
WithUniqueFor(ttl)prevents enqueueing a duplicate of the "same" job (by a uniqueness key derived from type+payload or a caller-supplied key) within a TTL window, using a RedisSET NXlock. - Queue priorities. Jobs can target named queues (e.g.
critical,default,low). Workers can be configured to drain higher-priority queues first (weighted / strict priority).
Non-Functional¶
- At-least-once delivery. Every successfully enqueued job is processed one or more times. Handlers MUST be idempotent. (We explicitly do not promise exactly-once.)
- No lost jobs on crash. If a worker is
kill -9'd while processing a job, that job MUST be recovered and re-processed. This is the core reliability guarantee and drives the reliable-queue / processing-list design. - Bounded concurrency per worker. Each worker process runs at most
Concurrencyhandlers at once. The pool must apply backpressure (do not dequeue more work than it can run). - Observability. Queue depth, processing rate, failure rate, retry rate, and processing latency must be exported as Prometheus metrics and graphable in Grafana.
- Graceful drain on shutdown. On
SIGTERM/SIGINT, the worker stops dequeuing new jobs, finishes in-flight jobs within a deadline, and exits cleanly. Anything not finished by the deadline must be safely recoverable.
Architecture¶
flowchart TD
subgraph Producers
P1[App / HTTP handler]
P2[Producer CLI / enqueuer]
P3[Cron trigger]
end
P1 -->|Enqueue| ENQ
P2 -->|Enqueue| ENQ
P3 -->|Enqueue| ENQ
ENQ{{Client.Enqueue}}
ENQ -->|immediate| PEND
ENQ -->|WithDelay / WithRunAt| SCHED
ENQ -->|WithUniqueFor| DEDUP[(SET NX dedup key)]
subgraph Redis
PEND[("pending list / stream\n(per queue: critical/default/low)")]
SCHED[("scheduled ZSET\nscore = run-at unix ms")]
PROC[("processing list\n(in-flight, per worker)")]
DLQ[("dead-letter queue\nlist + meta hash")]
META[("job hash\njob:{id} -> fields")]
end
subgraph Scheduler["Scheduler goroutine"]
TICK[tick every ~1s]
TICK -->|ZRANGEBYSCORE now| SCHED
SCHED -->|promote due jobs| PEND
end
subgraph WorkerPool["Worker pool (bounded concurrency)"]
DEQ[Dequeue\nBRPOPLPUSH pending -> processing]
DISP[Dispatch by type\nServeMux]
H[Handler]
ACK{handler result?}
end
PEND --> DEQ
DEQ --> PROC
DEQ --> DISP
DISP --> H
H --> ACK
ACK -->|success| ACKOK[LREM from processing\nDEL job hash]
ACKOK --> PROC
ACK -->|error & attempt < max| BACKOFF[compute backoff + jitter\nset RunAt = now + delay]
BACKOFF --> SCHED
BACKOFF -->|LREM from processing| PROC
ACK -->|error & attempt >= max| TODLQ[move to DLQ\nrecord LastError]
TODLQ --> DLQ
TODLQ -->|LREM from processing| PROC
subgraph Recovery["Recovery process (janitor)"]
SCAN[scan stale processing entries\n/ XAUTOCLAIM idle > visibility timeout]
SCAN -->|re-queue orphaned jobs| PEND
PROC --> SCAN
end
subgraph Metrics
H -.observe latency.-> HIST[processing_latency histogram]
ACKOK -.inc.-> CPROC[jobs_processed_total]
BACKOFF -.inc.-> CRETRY[jobs_retried_total]
TODLQ -.inc.-> CFAIL[jobs_failed_total]
PEND -.len.-> GDEPTH[queue_depth gauge]
end
Key flows:
- Enqueue writes job metadata and pushes the job id onto either the pending list (immediate) or the scheduled ZSET (delayed). Unique jobs first try to acquire a dedup key.
- Scheduler periodically promotes jobs whose run-at has passed from the ZSET into the pending list.
- Dequeue atomically moves a job id from
pendingtoprocessing(BRPOPLPUSH), guaranteeing the job is never "in limbo" — it is always in exactly one list. - Ack/Retry/DLQ: on success, the worker removes the id from
processing; on a retryable failure, it schedules a backoff retry; on terminal failure, it moves the job to the DLQ. - Recovery finds job ids that have sat in a
processinglist longer than the visibility timeout (because their worker died) and re-queues them.
Suggested Project Layout¶
Follows golang-standards/project-layout.
06-job-queue-redis/
├── cmd/
│ ├── worker/
│ │ └── main.go # builds a Server, registers handlers, Run()
│ └── producer/
│ └── main.go # enqueuer CLI: enqueue/inspect/dlq/requeue
├── internal/
│ ├── queue/
│ │ ├── broker.go # Broker interface + Redis implementation
│ │ ├── broker_redis.go # BRPOPLPUSH / Lua scripts, key layout
│ │ ├── job.go # Job struct, (de)serialization, status
│ │ ├── scheduler.go # ZSET scheduler: promote due jobs
│ │ ├── recovery.go # janitor: re-queue stale in-flight jobs
│ │ ├── retry.go # backoff + jitter policy
│ │ └── keys.go # Redis key naming helpers
│ ├── worker/
│ │ ├── pool.go # bounded worker pool, graceful drain
│ │ └── processor.go # dequeue -> dispatch -> ack/retry/dlq
│ ├── metrics/
│ │ └── metrics.go # Prometheus collectors + registration
│ └── admin/
│ └── admin.go # queue depth / DLQ inspection helpers
├── pkg/
│ └── jobqueue/ # PUBLIC API (importable by other projects)
│ ├── client.go # Client.Enqueue + Options
│ ├── server.go # Server.Register / Run / Shutdown
│ ├── task.go # Task, Job, Handler, ServeMux
│ └── options.go # WithQueue/WithMaxRetries/WithDelay/...
├── examples/
│ ├── email/ # example handler + enqueuer
│ └── image-resize/ # example with larger payloads
├── deploy/
│ ├── docker-compose.yml # redis + prometheus + grafana + worker
│ ├── prometheus.yml
│ └── grafana/
│ └── dashboards/jobqueue.json
├── Dockerfile # multi-stage build for worker + producer
├── go.mod
├── go.sum
├── README.md
└── SPEC.md
internal/ holds the engine; pkg/jobqueue/ is the thin, stable public surface
that re-exports the types other applications should depend on.
Data Model / Database¶
The Job¶
// Status is the lifecycle state of a job.
type Status string
const (
StatusPending Status = "pending" // waiting in a queue
StatusScheduled Status = "scheduled" // in the ZSET, not yet due
StatusActive Status = "active" // in a processing list, being run
StatusRetry Status = "retry" // failed, scheduled for another try
StatusCompleted Status = "completed" // succeeded (usually then deleted)
StatusDead Status = "dead" // exhausted retries -> DLQ
)
// Job is the full record persisted in Redis.
type Job struct {
ID string `json:"id"` // ULID/UUID
Type string `json:"type"` // routes to a handler
Payload json.RawMessage `json:"payload"` // opaque task data
Queue string `json:"queue"` // e.g. "default"
MaxRetries int `json:"max_retries"` // attempt cap
Attempt int `json:"attempt"` // attempts so far
RunAt time.Time `json:"run_at"` // earliest exec time
Status Status `json:"status"`
UniqueKey string `json:"unique_key,omitempty"`
CreatedAt time.Time `json:"created_at"`
LastError string `json:"last_error,omitempty"`
}
Jobs are serialized as JSON (compact, debuggable, language-agnostic). For a
performance stretch goal you could swap to msgpack or protobuf behind the
serializer.
Redis key layout¶
| Purpose | Type | Key | Notes |
|---|---|---|---|
| Job metadata | HASH | jq:job:{id} |
full Job fields (or a single JSON field) |
| Pending queue | LIST | jq:queue:{name} |
job ids, LPUSH in / BRPOPLPUSH out |
| Processing (in-flight) | LIST | jq:processing:{name}:{workerID} |
ids currently held by a worker |
| Scheduled / delayed | ZSET | jq:scheduled |
member = id, score = run-at unix ms |
| Retry (backoff waiting) | ZSET | jq:retry |
same shape as scheduled |
| Dead-letter queue | LIST | jq:dlq |
dead job ids |
| DLQ metadata | HASH | jq:job:{id} (status=dead) |
retained for inspection |
| Uniqueness lock | STRING | jq:unique:{hash} |
SET NX EX ttl, holds job id |
| Heartbeat / liveness | STRING | jq:worker:{id}:alive |
SET EX, used by recovery |
Reliable-queue pattern (Lists — recommended default)¶
The crux of "no lost jobs on crash" is that a job id is always in exactly one list. Dequeue uses an atomic move:
This atomically pops the id from the pending list and pushes it onto the worker's processing list, returning the id. Now:
- Success:
LREM jq:processing:... 1 {id}andDEL jq:job:{id}. - Retryable failure: compute backoff,
ZADD jq:retry <runAt> {id}, thenLREMit from processing. (Done atomically in a Lua script so the id is never in both retry and processing.) - Terminal failure:
LPUSH jq:dlq {id}, mark status=dead,LREMfrom processing. - Crash: the id stays in
jq:processing:default:{workerID}. The recovery janitor notices the worker's heartbeat has expired (or the entry is older than the visibility timeout) andRPOPLPUSH'es the id back onto the pending queue.
Multi-step ack/retry/DLQ transitions are written as Lua scripts (EVAL) so
they are atomic against concurrent workers and the scheduler.
RPOPLPUSH/BRPOPLPUSHare deprecated in favor ofLMOVE/BLMOVEin Redis 6.2+. UseBLMOVE ... RIGHT LEFTin new code; the semantics are identical. The doc and ADR should call this out.
Alternative: Redis Streams + consumer groups¶
Instead of lists you can use a Redis Stream per queue with a consumer group:
XADD jq:stream:default * id {id}to enqueue.XREADGROUP GROUP workers {consumer} COUNT n BLOCK t STREAMS jq:stream:default >to dequeue. Delivered-but-unacked entries sit in the consumer's Pending Entries List (PEL).XACK jq:stream:default workers {entryID}on success.XAUTOCLAIM jq:stream:default workers {consumer} {minIdleTime} 0to recover entries whose original consumer died (idle longer than the visibility timeout). This replaces the manual processing-list janitor.
Streams give you built-in recovery (PEL + XAUTOCLAIM), consumer-group
fan-out, and an inspectable backlog at the cost of more complex semantics and
trimming concerns. The ADR (below) compares the two; the recommended starting
point for learning is the Lists approach because the moving parts are
explicit.
API Design¶
The public library API (in pkg/jobqueue) is the deliverable other apps import.
It is deliberately close in spirit to asynq.
package jobqueue
import (
"context"
"time"
)
// ---- Producer side ----------------------------------------------------------
// Task is what a producer creates and enqueues.
type Task struct {
typ string
payload []byte
}
func NewTask(typ string, payload []byte) *Task { /* ... */ }
func (t *Task) Type() string { return t.typ }
func (t *Task) Payload() []byte { return t.payload }
// Client enqueues tasks. Safe for concurrent use.
type Client struct{ /* redis client, opts */ }
func NewClient(redisOpt RedisConnOpt) (*Client, error)
// Enqueue persists the task and returns info about the created job.
func (c *Client) Enqueue(ctx context.Context, task *Task, opts ...Option) (*JobInfo, error)
func (c *Client) Close() error
// JobInfo is returned from Enqueue (id, queue, state, run-at).
type JobInfo struct {
ID string
Queue string
Status Status
RunAt time.Time
}
// ---- Enqueue options --------------------------------------------------------
type Option interface{ apply(*enqueueConfig) }
func WithQueue(name string) Option // target queue / priority
func WithMaxRetries(n int) Option // attempt cap
func WithDelay(d time.Duration) Option // run after now+d
func WithRunAt(t time.Time) Option // run at an absolute time
func WithUniqueFor(ttl time.Duration) Option // dedup window (idempotent enqueue)
func WithUniqueKey(key string) Option // explicit dedup key
func WithTimeout(d time.Duration) Option // per-handler execution timeout
// ---- Consumer side ----------------------------------------------------------
// Job is the read-only view a handler receives.
type Job struct{ /* ... fields mirror the persisted Job ... */ }
func (j *Job) Type() string
func (j *Job) Payload() []byte
func (j *Job) Attempt() int // 0-based attempt count, for idempotency logic
// Handler processes one job. Returning nil acks; returning an error triggers
// retry/DLQ. Returning errors wrapped with SkipRetry sends straight to DLQ.
type Handler interface {
ProcessJob(ctx context.Context, job *Job) error
}
type HandlerFunc func(ctx context.Context, job *Job) error
func (f HandlerFunc) ProcessJob(ctx context.Context, j *Job) error { return f(ctx, j) }
// SkipRetry, wrapped around a handler error, marks it non-retryable.
var SkipRetry = errors.New("skip retry for this job")
// ServeMux routes by task type, like http.ServeMux.
type ServeMux struct{ /* ... */ }
func NewServeMux() *ServeMux
func (m *ServeMux) Handle(typ string, h Handler)
func (m *ServeMux) HandleFunc(typ string, fn HandlerFunc)
// Server runs the worker pool.
type Server struct{ /* ... */ }
type Config struct {
Concurrency int // max simultaneous handlers
Queues map[string]int // queue -> weight (priority)
ShutdownTimeout time.Duration // graceful drain deadline
RetryPolicy RetryFunc // backoff strategy
}
func NewServer(redisOpt RedisConnOpt, cfg Config) *Server
// Run blocks, processing jobs until ctx is cancelled or a signal arrives,
// then drains in-flight work within ShutdownTimeout.
func (s *Server) Run(mux *ServeMux) error
func (s *Server) Shutdown(ctx context.Context) error
// RetryFunc computes the delay before the next attempt.
type RetryFunc func(attempt int, err error, job *Job) time.Duration
Typical producer:
client, _ := jobqueue.NewClient(jobqueue.RedisConnOpt{Addr: "localhost:6379"})
payload, _ := json.Marshal(WelcomeEmail{UserID: 42})
info, _ := client.Enqueue(ctx, jobqueue.NewTask("email:welcome", payload),
jobqueue.WithQueue("critical"),
jobqueue.WithMaxRetries(5),
jobqueue.WithUniqueFor(10*time.Minute),
)
Typical worker:
mux := jobqueue.NewServeMux()
mux.HandleFunc("email:welcome", func(ctx context.Context, j *jobqueue.Job) error {
var p WelcomeEmail
if err := json.Unmarshal(j.Payload(), &p); err != nil {
return fmt.Errorf("%w: bad payload: %v", jobqueue.SkipRetry, err)
}
return sendWelcomeEmail(ctx, p.UserID) // idempotent
})
srv := jobqueue.NewServer(redisOpt, jobqueue.Config{
Concurrency: 20,
Queues: map[string]int{"critical": 6, "default": 3, "low": 1},
ShutdownTimeout: 30 * time.Second,
})
log.Fatal(srv.Run(mux))
Admin surface (CLI + HTTP)¶
A small inspector for operators, exposed via the producer/admin CLI and an
optional HTTP endpoint:
producer enqueue --type email:welcome --queue default --payload '{"user":1}'
producer stats # depth per queue, scheduled count, active count
producer dlq list # list dead jobs (id, type, last_error, attempts)
producer dlq requeue <jobID> # move a dead job back to its queue
producer dlq purge # clear the DLQ
HTTP (read-only) for dashboards/health:
GET /healthz -> 200 if worker is alive and Redis reachable
GET /stats -> JSON: per-queue depth, scheduled, active, dlq
GET /metrics -> Prometheus exposition (client_golang)
Tech Stack¶
| Concern | Choice |
|---|---|
| Language | Go 1.22+ (log/slog, errors.Join) |
| Redis client | github.com/redis/go-redis/v9 |
| Metrics | github.com/prometheus/client_golang |
| Structured logging | log/slog (stdlib) |
| Backoff | github.com/cenkalti/backoff/v4 or hand-rolled exponential + jitter |
| Unit tests (fake Redis) | github.com/alicebob/miniredis/v2 |
| Integration tests | github.com/testcontainers/testcontainers-go (+ modules/redis) |
| Assertions | github.com/stretchr/testify |
| IDs | github.com/oklog/ulid/v2 or github.com/google/uuid |
| CLI | stdlib flag, or github.com/spf13/cobra for subcommands |
Note:
miniredissupports LuaEVAL, lists, sorted sets, and hashes, so the reliable-queue logic can be unit tested without a real server. It does not faithfully emulateBLMOVEblocking timeouts, so blocking-dequeue behavior is best covered by the testcontainers integration tests.
Reference implementation to study (do not vendor — build your own):
github.com/hibiken/asynq.
Implementation Milestones¶
- Project skeleton —
go.mod, layout, Redis connection helper, slog setup. - Basic enqueue / dequeue —
Client.Enqueue-> pending list; workerBLMOVEloop -> processing list -> ack withLREM. Job persisted as a hash. - Typed handler dispatch —
ServeMux.Handle, route byType, unknown type handling, per-job timeout viacontext.WithTimeout. - Retries + backoff — exponential backoff with full jitter,
MaxRetriescap,LastErrorcapture,SkipRetrysentinel, retry ZSET, atomic Lua transition from processing -> retry. - Dead-letter queue — terminal failures moved to DLQ with metadata;
admin
dlq list/requeue/purge. - Delayed jobs + scheduler —
WithDelay/WithRunAt, scheduled ZSET, scheduler goroutine promoting due jobs (ZRANGEBYSCORE+ atomic move). - Stale-job recovery — worker heartbeats, janitor that re-queues
orphaned in-flight ids past the visibility timeout (or
XAUTOCLAIM). - Unique / idempotent enqueue —
SET NX EXdedup keys,WithUniqueFor, release on completion. - Queue priorities — weighted dequeue across
critical/default/low. - Metrics — counters, gauges, histogram;
/metricsendpoint; queue-depth collector. - Graceful shutdown — signal handling, stop dequeue, drain within deadline, requeue-on-timeout safety.
- Admin / inspector — CLI + read-only HTTP
/stats,/healthz. - Polish — examples, README, Docker Compose, Grafana dashboard.
Testing Strategy¶
- Unit tests against
miniredis— fast, in-process. Cover key layout, serialization, scheduler promotion logic, backoff computation, retry/DLQ Lua transitions, dedup-key behavior, and admin queries. Run on every commit. - Integration tests against real Redis via
testcontainers-go— spin up a throwawayredis:7container, exercise full enqueue -> process -> ack and the blockingBLMOVEdequeue path that miniredis can't fully emulate. Verify priority ordering and scheduler timing against a real server. - Crash-recovery test (the headline test) — enqueue a job; start a worker whose handler blocks/sleeps; forcibly cancel/kill the worker (or simulate by abandoning the processing-list entry and expiring the heartbeat) while the job is in-flight; start the recovery janitor + a fresh worker; assert the job is re-queued and ultimately processed. This proves "no lost jobs on crash".
- Idempotency tests — deliver the same job twice (force a duplicate
delivery) and assert the handler's observable side effect happens once when
guarded by a dedup key /
Attempt-aware logic. - At-least-once / retry tests — handler fails N-1 times then succeeds;
assert exactly the expected number of attempts and backoff scheduling; handler
always fails -> assert it lands in the DLQ with
LastError. - Race detector — run the worker-pool and scheduler tests with
go test -raceto catch data races in the concurrent dequeue/ack paths. - Metrics assertions — use
prometheus/client_golang'stestutil(CollectAndCompare,ToFloat64) to assertjobs_processed_total,jobs_failed_total,jobs_retried_total, and queue-depth gauge move as expected. - Graceful-shutdown test — start processing, send shutdown, assert in-flight jobs finish and no new jobs are dequeued after the drain begins.
Target: meaningful coverage of internal/queue and internal/worker; CI runs
unit tests always and integration tests behind a build tag / when Docker is
available.
Deployment¶
- Docker images — a multi-stage
Dockerfilebuilds static binaries for bothcmd/workerandcmd/producerfrom one source tree (build target selectable via build arg or two stages). Final images arescratch/distroless-based. - docker-compose (
deploy/docker-compose.yml) brings up: redis:7(with anappendonly yes/ AOF note for durability),- one or more
workerservices, prometheusscraping each worker's/metrics,grafanaprovisioned with the jobqueue dashboard.- Horizontal scaling — workers are stateless; run
Nreplicas (docker compose up --scale worker=4or a KubernetesDeployment). They coordinate purely through Redis; each has a unique worker id for its processing list and heartbeat. - Concurrency config —
Concurrencyand per-queue weights are set via env (JQ_CONCURRENCY,JQ_QUEUES=critical:6,default:3,low:1). Tune total throughput = replicas × concurrency. - Liveness via metrics / health —
/healthzfor liveness/readiness probes; alert in Prometheus whenqueue_depthgrows unbounded (consumers can't keep up) or whenjobs_failed_totalrate spikes (DLQ filling). - Operational notes — Redis is the single point of failure; document running Redis with AOF persistence and/or replication, and the visibility-timeout vs handler-duration tradeoff (too short -> duplicate processing of slow jobs; too long -> slow recovery after a crash).
Documentation Deliverables¶
- README — what it is, quickstart (
docker compose up), a runnable producer + worker example, configuration reference, and the admin CLI usage. - Delivery-semantics doc — a short
docs/delivery-semantics.mdstating plainly: this system is at-least-once; handlers must be idempotent; here is why exactly-once is not offered and how dedup keys narrow the window. - ADR: Redis Lists vs Streams —
docs/adr/0001-lists-vs-streams.mdcomparingBRPOPLPUSH/BLMOVEreliable lists against Streams consumer groups (XREADGROUP+XACK+XAUTOCLAIM) on: recovery ergonomics, inspectability, memory/trimming, fan-out, and complexity — with the decision and its consequences. - Grafana dashboard JSON —
deploy/grafana/dashboards/jobqueue.jsonwith panels for queue depth per queue, processing rate, failure/retry rate, p50/p95 latency, and DLQ size. - Godoc — package-level doc comments on
pkg/jobqueuewith the producer and worker examples above as runnableExamplefunctions.
Stretch Goals / Future Improvements¶
- Rate limiting per queue — token-bucket (Redis-backed) so a queue processes at most R jobs/sec regardless of worker count (e.g. respect a third-party API limit).
- Job dependencies / workflows — fan-out/fan-in: a job that enqueues children and a "group" that completes only when all children finish (chains, groups, chords).
- Cron-style periodic jobs — a scheduler component that enqueues jobs on a
cron spec (
@every 5m,0 3 * * *) with leader election so only one worker enqueues each tick. - Web dashboard — a small UI (like asynqmon) over the admin API to browse queues, scheduled jobs, and the DLQ, and to requeue/cancel.
- Exactly-once-ish via dedup keys — persist a processed-job marker keyed by job id with a TTL so re-deliveries become no-ops at the framework level (still fundamentally at-least-once, but with framework-level dedup).
- Pluggable broker backends — formalize the
Brokerinterface and add a second backend (e.g. PostgresSELECT ... FOR UPDATE SKIP LOCKED, or in-memory for tests) to prove the abstraction. - OpenTelemetry tracing — propagate a trace context from enqueue through processing so a job's whole life is one distributed trace; export spans + the existing Prometheus metrics via OTel.
Lessons-Learned Prompts¶
- Why is exactly-once delivery practically impossible in a system like this, and what does at-least-once force every handler author to do?
- Walk through exactly what happens to an in-flight job when a worker is
kill -9'd. Which Redis key holds it, who notices, and how is it recovered? What could still go wrong (e.g. the side effect already happened)? - How did you make a handler idempotent? Where does the dedup key live, what is its TTL, and what is the failure mode if the TTL is too short or too long?
- Why exponential backoff with jitter rather than a fixed retry delay or plain exponential backoff? What problem does the jitter solve under load?
- What is your "visibility timeout" (how long a job may sit in processing before recovery re-queues it), and how does its value trade off duplicate processing against recovery latency?
- How does the worker pool apply backpressure when handlers are slower than the enqueue rate, and how would you detect and alert on a queue that is falling behind?
Portfolio & Resume¶
Resume Bullets¶
- Built a Redis-backed distributed job queue in Go (
go-redis/v9) providing at-least-once delivery with a reliable-queue (BLMOVEprocessing-list) design, automatic crash recovery of orphaned in-flight jobs, and a dead-letter queue for poison messages. - Implemented exponential-backoff-with-jitter retries, delayed/scheduled jobs
via a Redis sorted set, idempotent (deduplicated) enqueue, and per-queue
priorities behind an ergonomic
Enqueue/Register/Runlibrary API. - Instrumented the system with Prometheus metrics (throughput, failure & retry
rates, queue-depth gauges, processing-latency histogram) and shipped a Docker
Compose stack with Grafana dashboards; validated reliability with
testcontainers-gointegration and crash-recovery tests under-race.
Interview Talking Points¶
- The reliable-queue pattern: why an atomic
BLMOVE pending -> processing(formerlyBRPOPLPUSH) keeps a job in exactly one list and makes crash recovery possible, vs a naiveLPOPthat loses jobs on crash. - At-least-once vs exactly-once: why you chose at-least-once, why exactly-once is a myth without idempotent consumers, and how dedup keys narrow (but don't close) the duplicate window.
- Idempotency: concrete strategies — natural idempotency, dedup markers
keyed by job id, and
Attempt-aware handlers. - Dead-letter queues: why poison messages must be quarantined rather than retried forever, and how operators inspect and requeue them.
- Backoff + jitter: preventing thundering-herd retries against a failing downstream by spreading retry times.
- Observability & backpressure: which metrics tell you the system is healthy, how queue depth signals that consumers can't keep up, and how the bounded worker pool applies backpressure.
- Lists vs Streams: the tradeoffs that drove the ADR, and when Streams'
built-in
XAUTOCLAIMrecovery would be worth the added complexity.