Skip to content

Table of Contents

06 — Redis Job Queue

Overview

This project builds a Redis-backed background job / worker system in Go: a producer/consumer task queue that lets you enqueue typed jobs from anywhere in your application and process them asynchronously on a pool of workers. It is the kind of infrastructure that sits behind "send the welcome email", "resize the uploaded image", "charge the customer", or "rebuild the search index" — work that must happen, but not on the request path.

The system is intentionally modeled on real production tooling. The canonical open-source reference is hibiken/asynq, a mature Redis-backed task queue used in production by many Go shops; other references include Sidekiq (Ruby), Celery (Python), and River (Postgres-backed Go). You will build a simplified version of this yourself so that you understand, line by line, how reliability, retries, scheduling, and recovery actually work — rather than treating the queue as a black box.

The hard part of a job queue is not "push a message onto a list and pop it off the other end". The hard part is everything around the edges:

  • What happens when a worker crashes while holding a job? (recovery)
  • What happens when a handler fails? (retries with backoff, then a DLQ)
  • What happens when the same job is delivered twice? (idempotency)
  • How do you run a job at 3am next Tuesday? (delayed / scheduled jobs)
  • How do you know the system is healthy and keeping up? (metrics)
  • How do you deploy a new version without dropping in-flight work? (graceful drain)

By the end you will have a small library (pkg/jobqueue) with an ergonomic Client.Enqueue / Server.Register / Server.Run API, a worker binary, a producer/enqueuer CLI, Prometheus metrics, and a Docker Compose stack that runs Redis + Prometheus + Grafana.

Level: Advanced. This assumes you are comfortable with goroutines, channels, context, interfaces, and encoding/json, and that you have used Redis before (or are willing to learn its data structures as you go).

Learning Objectives

By completing this project you will be able to:

  • Design a reliable queue on top of Redis that does not lose jobs when a worker process dies mid-flight.
  • Explain and implement at-least-once delivery and why exactly-once is (practically) impossible without idempotent consumers.
  • Write idempotent handlers and use Redis-based dedup keys to deduplicate work.
  • Implement exponential backoff with jitter for retries, and a dead-letter queue (DLQ) for jobs that exhaust their attempts.
  • Implement delayed / scheduled jobs using a Redis sorted set keyed by run-at timestamp, plus a scheduler goroutine that promotes due jobs.
  • Implement stale-job recovery (re-queue jobs orphaned by a crashed worker) using either an LREM-from-processing pattern or Redis Streams XAUTOCLAIM.
  • Build a bounded worker pool with backpressure and graceful shutdown that drains in-flight work before exiting.
  • Instrument everything with Prometheus metrics: counters for processed / failed / retried jobs, gauges for queue depth, and a histogram for processing latency.
  • Compare the two dominant Redis queue designs — Lists (BRPOPLPUSH) vs Streams (consumer groups + XACK/XAUTOCLAIM) — and justify a choice in an ADR.
  • Test distributed-systems code with miniredis (unit) and testcontainers-go (integration), including a deliberate crash-recovery test.

Requirements

Functional

  • Enqueue a job with a type + payload. A producer calls client.Enqueue(ctx, NewTask("email:welcome", payload)). The type string routes the job to a handler; the payload is opaque bytes (JSON).
  • Register typed handlers. A worker registers handlers by type, e.g. mux.Handle("email:welcome", welcomeHandler). Unknown types are an error (job is failed/retried or sent to DLQ, per policy).
  • Retries with a max-attempts cap. When a handler returns an error, the job is retried up to MaxRetries times with exponential backoff + jitter between attempts. Each attempt increments Attempt and records LastError.
  • Dead-letter queue (DLQ). When a job exhausts MaxRetries, it is moved to a DLQ (a Redis list/stream + metadata hash) rather than silently dropped, so it can be inspected, fixed, and optionally re-enqueued.
  • Scheduled / delayed jobs. WithDelay(d) or WithRunAt(t) schedules a job for future execution. Delayed jobs live in a sorted set scored by run-at; a scheduler goroutine promotes due jobs to the pending queue.
  • Unique / idempotent jobs. WithUniqueFor(ttl) prevents enqueueing a duplicate of the "same" job (by a uniqueness key derived from type+payload or a caller-supplied key) within a TTL window, using a Redis SET NX lock.
  • Queue priorities. Jobs can target named queues (e.g. critical, default, low). Workers can be configured to drain higher-priority queues first (weighted / strict priority).

Non-Functional

  • At-least-once delivery. Every successfully enqueued job is processed one or more times. Handlers MUST be idempotent. (We explicitly do not promise exactly-once.)
  • No lost jobs on crash. If a worker is kill -9'd while processing a job, that job MUST be recovered and re-processed. This is the core reliability guarantee and drives the reliable-queue / processing-list design.
  • Bounded concurrency per worker. Each worker process runs at most Concurrency handlers at once. The pool must apply backpressure (do not dequeue more work than it can run).
  • Observability. Queue depth, processing rate, failure rate, retry rate, and processing latency must be exported as Prometheus metrics and graphable in Grafana.
  • Graceful drain on shutdown. On SIGTERM/SIGINT, the worker stops dequeuing new jobs, finishes in-flight jobs within a deadline, and exits cleanly. Anything not finished by the deadline must be safely recoverable.

Architecture

flowchart TD
    subgraph Producers
        P1[App / HTTP handler]
        P2[Producer CLI / enqueuer]
        P3[Cron trigger]
    end

    P1 -->|Enqueue| ENQ
    P2 -->|Enqueue| ENQ
    P3 -->|Enqueue| ENQ

    ENQ{{Client.Enqueue}}
    ENQ -->|immediate| PEND
    ENQ -->|WithDelay / WithRunAt| SCHED
    ENQ -->|WithUniqueFor| DEDUP[(SET NX dedup key)]

    subgraph Redis
        PEND[("pending list / stream\n(per queue: critical/default/low)")]
        SCHED[("scheduled ZSET\nscore = run-at unix ms")]
        PROC[("processing list\n(in-flight, per worker)")]
        DLQ[("dead-letter queue\nlist + meta hash")]
        META[("job hash\njob:{id} -> fields")]
    end

    subgraph Scheduler["Scheduler goroutine"]
        TICK[tick every ~1s]
        TICK -->|ZRANGEBYSCORE now| SCHED
        SCHED -->|promote due jobs| PEND
    end

    subgraph WorkerPool["Worker pool (bounded concurrency)"]
        DEQ[Dequeue\nBRPOPLPUSH pending -> processing]
        DISP[Dispatch by type\nServeMux]
        H[Handler]
        ACK{handler result?}
    end

    PEND --> DEQ
    DEQ --> PROC
    DEQ --> DISP
    DISP --> H
    H --> ACK
    ACK -->|success| ACKOK[LREM from processing\nDEL job hash]
    ACKOK --> PROC

    ACK -->|error & attempt < max| BACKOFF[compute backoff + jitter\nset RunAt = now + delay]
    BACKOFF --> SCHED
    BACKOFF -->|LREM from processing| PROC

    ACK -->|error & attempt >= max| TODLQ[move to DLQ\nrecord LastError]
    TODLQ --> DLQ
    TODLQ -->|LREM from processing| PROC

    subgraph Recovery["Recovery process (janitor)"]
        SCAN[scan stale processing entries\n/ XAUTOCLAIM idle > visibility timeout]
        SCAN -->|re-queue orphaned jobs| PEND
        PROC --> SCAN
    end

    subgraph Metrics
        H -.observe latency.-> HIST[processing_latency histogram]
        ACKOK -.inc.-> CPROC[jobs_processed_total]
        BACKOFF -.inc.-> CRETRY[jobs_retried_total]
        TODLQ -.inc.-> CFAIL[jobs_failed_total]
        PEND -.len.-> GDEPTH[queue_depth gauge]
    end

Key flows:

  1. Enqueue writes job metadata and pushes the job id onto either the pending list (immediate) or the scheduled ZSET (delayed). Unique jobs first try to acquire a dedup key.
  2. Scheduler periodically promotes jobs whose run-at has passed from the ZSET into the pending list.
  3. Dequeue atomically moves a job id from pending to processing (BRPOPLPUSH), guaranteeing the job is never "in limbo" — it is always in exactly one list.
  4. Ack/Retry/DLQ: on success, the worker removes the id from processing; on a retryable failure, it schedules a backoff retry; on terminal failure, it moves the job to the DLQ.
  5. Recovery finds job ids that have sat in a processing list longer than the visibility timeout (because their worker died) and re-queues them.

Suggested Project Layout

Follows golang-standards/project-layout.

06-job-queue-redis/
├── cmd/
│   ├── worker/
│   │   └── main.go            # builds a Server, registers handlers, Run()
│   └── producer/
│       └── main.go            # enqueuer CLI: enqueue/inspect/dlq/requeue
├── internal/
│   ├── queue/
│   │   ├── broker.go          # Broker interface + Redis implementation
│   │   ├── broker_redis.go    # BRPOPLPUSH / Lua scripts, key layout
│   │   ├── job.go             # Job struct, (de)serialization, status
│   │   ├── scheduler.go       # ZSET scheduler: promote due jobs
│   │   ├── recovery.go        # janitor: re-queue stale in-flight jobs
│   │   ├── retry.go           # backoff + jitter policy
│   │   └── keys.go            # Redis key naming helpers
│   ├── worker/
│   │   ├── pool.go            # bounded worker pool, graceful drain
│   │   └── processor.go       # dequeue -> dispatch -> ack/retry/dlq
│   ├── metrics/
│   │   └── metrics.go         # Prometheus collectors + registration
│   └── admin/
│       └── admin.go           # queue depth / DLQ inspection helpers
├── pkg/
│   └── jobqueue/              # PUBLIC API (importable by other projects)
│       ├── client.go          # Client.Enqueue + Options
│       ├── server.go          # Server.Register / Run / Shutdown
│       ├── task.go            # Task, Job, Handler, ServeMux
│       └── options.go         # WithQueue/WithMaxRetries/WithDelay/...
├── examples/
│   ├── email/                 # example handler + enqueuer
│   └── image-resize/          # example with larger payloads
├── deploy/
│   ├── docker-compose.yml     # redis + prometheus + grafana + worker
│   ├── prometheus.yml
│   └── grafana/
│       └── dashboards/jobqueue.json
├── Dockerfile                 # multi-stage build for worker + producer
├── go.mod
├── go.sum
├── README.md
└── SPEC.md

internal/ holds the engine; pkg/jobqueue/ is the thin, stable public surface that re-exports the types other applications should depend on.

Data Model / Database

The Job

// Status is the lifecycle state of a job.
type Status string

const (
    StatusPending    Status = "pending"    // waiting in a queue
    StatusScheduled  Status = "scheduled"  // in the ZSET, not yet due
    StatusActive     Status = "active"     // in a processing list, being run
    StatusRetry      Status = "retry"      // failed, scheduled for another try
    StatusCompleted  Status = "completed"  // succeeded (usually then deleted)
    StatusDead       Status = "dead"       // exhausted retries -> DLQ
)

// Job is the full record persisted in Redis.
type Job struct {
    ID         string          `json:"id"`          // ULID/UUID
    Type       string          `json:"type"`        // routes to a handler
    Payload    json.RawMessage `json:"payload"`     // opaque task data
    Queue      string          `json:"queue"`       // e.g. "default"
    MaxRetries int             `json:"max_retries"` // attempt cap
    Attempt    int             `json:"attempt"`     // attempts so far
    RunAt      time.Time       `json:"run_at"`      // earliest exec time
    Status     Status          `json:"status"`
    UniqueKey  string          `json:"unique_key,omitempty"`
    CreatedAt  time.Time       `json:"created_at"`
    LastError  string          `json:"last_error,omitempty"`
}

Jobs are serialized as JSON (compact, debuggable, language-agnostic). For a performance stretch goal you could swap to msgpack or protobuf behind the serializer.

Redis key layout

Purpose Type Key Notes
Job metadata HASH jq:job:{id} full Job fields (or a single JSON field)
Pending queue LIST jq:queue:{name} job ids, LPUSH in / BRPOPLPUSH out
Processing (in-flight) LIST jq:processing:{name}:{workerID} ids currently held by a worker
Scheduled / delayed ZSET jq:scheduled member = id, score = run-at unix ms
Retry (backoff waiting) ZSET jq:retry same shape as scheduled
Dead-letter queue LIST jq:dlq dead job ids
DLQ metadata HASH jq:job:{id} (status=dead) retained for inspection
Uniqueness lock STRING jq:unique:{hash} SET NX EX ttl, holds job id
Heartbeat / liveness STRING jq:worker:{id}:alive SET EX, used by recovery

The crux of "no lost jobs on crash" is that a job id is always in exactly one list. Dequeue uses an atomic move:

BRPOPLPUSH jq:queue:default jq:processing:default:{workerID} <timeout>

This atomically pops the id from the pending list and pushes it onto the worker's processing list, returning the id. Now:

  • Success: LREM jq:processing:... 1 {id} and DEL jq:job:{id}.
  • Retryable failure: compute backoff, ZADD jq:retry <runAt> {id}, then LREM it from processing. (Done atomically in a Lua script so the id is never in both retry and processing.)
  • Terminal failure: LPUSH jq:dlq {id}, mark status=dead, LREM from processing.
  • Crash: the id stays in jq:processing:default:{workerID}. The recovery janitor notices the worker's heartbeat has expired (or the entry is older than the visibility timeout) and RPOPLPUSH'es the id back onto the pending queue.

Multi-step ack/retry/DLQ transitions are written as Lua scripts (EVAL) so they are atomic against concurrent workers and the scheduler.

RPOPLPUSH/BRPOPLPUSH are deprecated in favor of LMOVE/BLMOVE in Redis 6.2+. Use BLMOVE ... RIGHT LEFT in new code; the semantics are identical. The doc and ADR should call this out.

Alternative: Redis Streams + consumer groups

Instead of lists you can use a Redis Stream per queue with a consumer group:

  • XADD jq:stream:default * id {id} to enqueue.
  • XREADGROUP GROUP workers {consumer} COUNT n BLOCK t STREAMS jq:stream:default > to dequeue. Delivered-but-unacked entries sit in the consumer's Pending Entries List (PEL).
  • XACK jq:stream:default workers {entryID} on success.
  • XAUTOCLAIM jq:stream:default workers {consumer} {minIdleTime} 0 to recover entries whose original consumer died (idle longer than the visibility timeout). This replaces the manual processing-list janitor.

Streams give you built-in recovery (PEL + XAUTOCLAIM), consumer-group fan-out, and an inspectable backlog at the cost of more complex semantics and trimming concerns. The ADR (below) compares the two; the recommended starting point for learning is the Lists approach because the moving parts are explicit.

API Design

The public library API (in pkg/jobqueue) is the deliverable other apps import. It is deliberately close in spirit to asynq.

package jobqueue

import (
    "context"
    "time"
)

// ---- Producer side ----------------------------------------------------------

// Task is what a producer creates and enqueues.
type Task struct {
    typ     string
    payload []byte
}

func NewTask(typ string, payload []byte) *Task { /* ... */ }
func (t *Task) Type() string                   { return t.typ }
func (t *Task) Payload() []byte                 { return t.payload }

// Client enqueues tasks. Safe for concurrent use.
type Client struct{ /* redis client, opts */ }

func NewClient(redisOpt RedisConnOpt) (*Client, error)

// Enqueue persists the task and returns info about the created job.
func (c *Client) Enqueue(ctx context.Context, task *Task, opts ...Option) (*JobInfo, error)

func (c *Client) Close() error

// JobInfo is returned from Enqueue (id, queue, state, run-at).
type JobInfo struct {
    ID     string
    Queue  string
    Status Status
    RunAt  time.Time
}

// ---- Enqueue options --------------------------------------------------------

type Option interface{ apply(*enqueueConfig) }

func WithQueue(name string) Option            // target queue / priority
func WithMaxRetries(n int) Option             // attempt cap
func WithDelay(d time.Duration) Option        // run after now+d
func WithRunAt(t time.Time) Option            // run at an absolute time
func WithUniqueFor(ttl time.Duration) Option  // dedup window (idempotent enqueue)
func WithUniqueKey(key string) Option         // explicit dedup key
func WithTimeout(d time.Duration) Option      // per-handler execution timeout

// ---- Consumer side ----------------------------------------------------------

// Job is the read-only view a handler receives.
type Job struct{ /* ... fields mirror the persisted Job ... */ }

func (j *Job) Type() string
func (j *Job) Payload() []byte
func (j *Job) Attempt() int   // 0-based attempt count, for idempotency logic

// Handler processes one job. Returning nil acks; returning an error triggers
// retry/DLQ. Returning errors wrapped with SkipRetry sends straight to DLQ.
type Handler interface {
    ProcessJob(ctx context.Context, job *Job) error
}
type HandlerFunc func(ctx context.Context, job *Job) error
func (f HandlerFunc) ProcessJob(ctx context.Context, j *Job) error { return f(ctx, j) }

// SkipRetry, wrapped around a handler error, marks it non-retryable.
var SkipRetry = errors.New("skip retry for this job")

// ServeMux routes by task type, like http.ServeMux.
type ServeMux struct{ /* ... */ }
func NewServeMux() *ServeMux
func (m *ServeMux) Handle(typ string, h Handler)
func (m *ServeMux) HandleFunc(typ string, fn HandlerFunc)

// Server runs the worker pool.
type Server struct{ /* ... */ }

type Config struct {
    Concurrency     int            // max simultaneous handlers
    Queues          map[string]int // queue -> weight (priority)
    ShutdownTimeout time.Duration  // graceful drain deadline
    RetryPolicy     RetryFunc      // backoff strategy
}

func NewServer(redisOpt RedisConnOpt, cfg Config) *Server

// Run blocks, processing jobs until ctx is cancelled or a signal arrives,
// then drains in-flight work within ShutdownTimeout.
func (s *Server) Run(mux *ServeMux) error
func (s *Server) Shutdown(ctx context.Context) error

// RetryFunc computes the delay before the next attempt.
type RetryFunc func(attempt int, err error, job *Job) time.Duration

Typical producer:

client, _ := jobqueue.NewClient(jobqueue.RedisConnOpt{Addr: "localhost:6379"})
payload, _ := json.Marshal(WelcomeEmail{UserID: 42})
info, _ := client.Enqueue(ctx, jobqueue.NewTask("email:welcome", payload),
    jobqueue.WithQueue("critical"),
    jobqueue.WithMaxRetries(5),
    jobqueue.WithUniqueFor(10*time.Minute),
)

Typical worker:

mux := jobqueue.NewServeMux()
mux.HandleFunc("email:welcome", func(ctx context.Context, j *jobqueue.Job) error {
    var p WelcomeEmail
    if err := json.Unmarshal(j.Payload(), &p); err != nil {
        return fmt.Errorf("%w: bad payload: %v", jobqueue.SkipRetry, err)
    }
    return sendWelcomeEmail(ctx, p.UserID) // idempotent
})

srv := jobqueue.NewServer(redisOpt, jobqueue.Config{
    Concurrency:     20,
    Queues:          map[string]int{"critical": 6, "default": 3, "low": 1},
    ShutdownTimeout: 30 * time.Second,
})
log.Fatal(srv.Run(mux))

Admin surface (CLI + HTTP)

A small inspector for operators, exposed via the producer/admin CLI and an optional HTTP endpoint:

producer enqueue --type email:welcome --queue default --payload '{"user":1}'
producer stats                  # depth per queue, scheduled count, active count
producer dlq list               # list dead jobs (id, type, last_error, attempts)
producer dlq requeue <jobID>    # move a dead job back to its queue
producer dlq purge              # clear the DLQ

HTTP (read-only) for dashboards/health:

GET /healthz            -> 200 if worker is alive and Redis reachable
GET /stats              -> JSON: per-queue depth, scheduled, active, dlq
GET /metrics            -> Prometheus exposition (client_golang)

Tech Stack

Concern Choice
Language Go 1.22+ (log/slog, errors.Join)
Redis client github.com/redis/go-redis/v9
Metrics github.com/prometheus/client_golang
Structured logging log/slog (stdlib)
Backoff github.com/cenkalti/backoff/v4 or hand-rolled exponential + jitter
Unit tests (fake Redis) github.com/alicebob/miniredis/v2
Integration tests github.com/testcontainers/testcontainers-go (+ modules/redis)
Assertions github.com/stretchr/testify
IDs github.com/oklog/ulid/v2 or github.com/google/uuid
CLI stdlib flag, or github.com/spf13/cobra for subcommands

Note: miniredis supports Lua EVAL, lists, sorted sets, and hashes, so the reliable-queue logic can be unit tested without a real server. It does not faithfully emulate BLMOVE blocking timeouts, so blocking-dequeue behavior is best covered by the testcontainers integration tests.

Reference implementation to study (do not vendor — build your own): github.com/hibiken/asynq.

Implementation Milestones

  • Project skeletongo.mod, layout, Redis connection helper, slog setup.
  • Basic enqueue / dequeueClient.Enqueue -> pending list; worker BLMOVE loop -> processing list -> ack with LREM. Job persisted as a hash.
  • Typed handler dispatchServeMux.Handle, route by Type, unknown type handling, per-job timeout via context.WithTimeout.
  • Retries + backoff — exponential backoff with full jitter, MaxRetries cap, LastError capture, SkipRetry sentinel, retry ZSET, atomic Lua transition from processing -> retry.
  • Dead-letter queue — terminal failures moved to DLQ with metadata; admin dlq list/requeue/purge.
  • Delayed jobs + schedulerWithDelay/WithRunAt, scheduled ZSET, scheduler goroutine promoting due jobs (ZRANGEBYSCORE + atomic move).
  • Stale-job recovery — worker heartbeats, janitor that re-queues orphaned in-flight ids past the visibility timeout (or XAUTOCLAIM).
  • Unique / idempotent enqueueSET NX EX dedup keys, WithUniqueFor, release on completion.
  • Queue priorities — weighted dequeue across critical/default/low.
  • Metrics — counters, gauges, histogram; /metrics endpoint; queue-depth collector.
  • Graceful shutdown — signal handling, stop dequeue, drain within deadline, requeue-on-timeout safety.
  • Admin / inspector — CLI + read-only HTTP /stats, /healthz.
  • Polish — examples, README, Docker Compose, Grafana dashboard.

Testing Strategy

  • Unit tests against miniredis — fast, in-process. Cover key layout, serialization, scheduler promotion logic, backoff computation, retry/DLQ Lua transitions, dedup-key behavior, and admin queries. Run on every commit.
  • Integration tests against real Redis via testcontainers-go — spin up a throwaway redis:7 container, exercise full enqueue -> process -> ack and the blocking BLMOVE dequeue path that miniredis can't fully emulate. Verify priority ordering and scheduler timing against a real server.
  • Crash-recovery test (the headline test) — enqueue a job; start a worker whose handler blocks/sleeps; forcibly cancel/kill the worker (or simulate by abandoning the processing-list entry and expiring the heartbeat) while the job is in-flight; start the recovery janitor + a fresh worker; assert the job is re-queued and ultimately processed. This proves "no lost jobs on crash".
  • Idempotency tests — deliver the same job twice (force a duplicate delivery) and assert the handler's observable side effect happens once when guarded by a dedup key / Attempt-aware logic.
  • At-least-once / retry tests — handler fails N-1 times then succeeds; assert exactly the expected number of attempts and backoff scheduling; handler always fails -> assert it lands in the DLQ with LastError.
  • Race detector — run the worker-pool and scheduler tests with go test -race to catch data races in the concurrent dequeue/ack paths.
  • Metrics assertions — use prometheus/client_golang's testutil (CollectAndCompare, ToFloat64) to assert jobs_processed_total, jobs_failed_total, jobs_retried_total, and queue-depth gauge move as expected.
  • Graceful-shutdown test — start processing, send shutdown, assert in-flight jobs finish and no new jobs are dequeued after the drain begins.

Target: meaningful coverage of internal/queue and internal/worker; CI runs unit tests always and integration tests behind a build tag / when Docker is available.

Deployment

  • Docker images — a multi-stage Dockerfile builds static binaries for both cmd/worker and cmd/producer from one source tree (build target selectable via build arg or two stages). Final images are scratch/distroless-based.
  • docker-compose (deploy/docker-compose.yml) brings up:
  • redis:7 (with an appendonly yes / AOF note for durability),
  • one or more worker services,
  • prometheus scraping each worker's /metrics,
  • grafana provisioned with the jobqueue dashboard.
  • Horizontal scaling — workers are stateless; run N replicas (docker compose up --scale worker=4 or a Kubernetes Deployment). They coordinate purely through Redis; each has a unique worker id for its processing list and heartbeat.
  • Concurrency configConcurrency and per-queue weights are set via env (JQ_CONCURRENCY, JQ_QUEUES=critical:6,default:3,low:1). Tune total throughput = replicas × concurrency.
  • Liveness via metrics / health/healthz for liveness/readiness probes; alert in Prometheus when queue_depth grows unbounded (consumers can't keep up) or when jobs_failed_total rate spikes (DLQ filling).
  • Operational notes — Redis is the single point of failure; document running Redis with AOF persistence and/or replication, and the visibility-timeout vs handler-duration tradeoff (too short -> duplicate processing of slow jobs; too long -> slow recovery after a crash).

Documentation Deliverables

  • README — what it is, quickstart (docker compose up), a runnable producer + worker example, configuration reference, and the admin CLI usage.
  • Delivery-semantics doc — a short docs/delivery-semantics.md stating plainly: this system is at-least-once; handlers must be idempotent; here is why exactly-once is not offered and how dedup keys narrow the window.
  • ADR: Redis Lists vs Streamsdocs/adr/0001-lists-vs-streams.md comparing BRPOPLPUSH/BLMOVE reliable lists against Streams consumer groups (XREADGROUP + XACK + XAUTOCLAIM) on: recovery ergonomics, inspectability, memory/trimming, fan-out, and complexity — with the decision and its consequences.
  • Grafana dashboard JSONdeploy/grafana/dashboards/jobqueue.json with panels for queue depth per queue, processing rate, failure/retry rate, p50/p95 latency, and DLQ size.
  • Godoc — package-level doc comments on pkg/jobqueue with the producer and worker examples above as runnable Example functions.

Stretch Goals / Future Improvements

  • Rate limiting per queue — token-bucket (Redis-backed) so a queue processes at most R jobs/sec regardless of worker count (e.g. respect a third-party API limit).
  • Job dependencies / workflows — fan-out/fan-in: a job that enqueues children and a "group" that completes only when all children finish (chains, groups, chords).
  • Cron-style periodic jobs — a scheduler component that enqueues jobs on a cron spec (@every 5m, 0 3 * * *) with leader election so only one worker enqueues each tick.
  • Web dashboard — a small UI (like asynqmon) over the admin API to browse queues, scheduled jobs, and the DLQ, and to requeue/cancel.
  • Exactly-once-ish via dedup keys — persist a processed-job marker keyed by job id with a TTL so re-deliveries become no-ops at the framework level (still fundamentally at-least-once, but with framework-level dedup).
  • Pluggable broker backends — formalize the Broker interface and add a second backend (e.g. Postgres SELECT ... FOR UPDATE SKIP LOCKED, or in-memory for tests) to prove the abstraction.
  • OpenTelemetry tracing — propagate a trace context from enqueue through processing so a job's whole life is one distributed trace; export spans + the existing Prometheus metrics via OTel.

Lessons-Learned Prompts

  1. Why is exactly-once delivery practically impossible in a system like this, and what does at-least-once force every handler author to do?
  2. Walk through exactly what happens to an in-flight job when a worker is kill -9'd. Which Redis key holds it, who notices, and how is it recovered? What could still go wrong (e.g. the side effect already happened)?
  3. How did you make a handler idempotent? Where does the dedup key live, what is its TTL, and what is the failure mode if the TTL is too short or too long?
  4. Why exponential backoff with jitter rather than a fixed retry delay or plain exponential backoff? What problem does the jitter solve under load?
  5. What is your "visibility timeout" (how long a job may sit in processing before recovery re-queues it), and how does its value trade off duplicate processing against recovery latency?
  6. How does the worker pool apply backpressure when handlers are slower than the enqueue rate, and how would you detect and alert on a queue that is falling behind?

Portfolio & Resume

Resume Bullets

  • Built a Redis-backed distributed job queue in Go (go-redis/v9) providing at-least-once delivery with a reliable-queue (BLMOVE processing-list) design, automatic crash recovery of orphaned in-flight jobs, and a dead-letter queue for poison messages.
  • Implemented exponential-backoff-with-jitter retries, delayed/scheduled jobs via a Redis sorted set, idempotent (deduplicated) enqueue, and per-queue priorities behind an ergonomic Enqueue / Register / Run library API.
  • Instrumented the system with Prometheus metrics (throughput, failure & retry rates, queue-depth gauges, processing-latency histogram) and shipped a Docker Compose stack with Grafana dashboards; validated reliability with testcontainers-go integration and crash-recovery tests under -race.

Interview Talking Points

  • The reliable-queue pattern: why an atomic BLMOVE pending -> processing (formerly BRPOPLPUSH) keeps a job in exactly one list and makes crash recovery possible, vs a naive LPOP that loses jobs on crash.
  • At-least-once vs exactly-once: why you chose at-least-once, why exactly-once is a myth without idempotent consumers, and how dedup keys narrow (but don't close) the duplicate window.
  • Idempotency: concrete strategies — natural idempotency, dedup markers keyed by job id, and Attempt-aware handlers.
  • Dead-letter queues: why poison messages must be quarantined rather than retried forever, and how operators inspect and requeue them.
  • Backoff + jitter: preventing thundering-herd retries against a failing downstream by spreading retry times.
  • Observability & backpressure: which metrics tell you the system is healthy, how queue depth signals that consumers can't keep up, and how the bounded worker pool applies backpressure.
  • Lists vs Streams: the tradeoffs that drove the ADR, and when Streams' built-in XAUTOCLAIM recovery would be worth the added complexity.