Skip to content

Day 137 — Retries & Dead-Letter

Month 5 · Week 4 · ⬅ Day 136 · Day 138 ➡ · Journal index

🎯 Learning Objective

Make job processing resilient: retry transient failures with exponential backoff + jitter under a budget, classify permanent failures, and route exhausted jobs to a dead-letter queue.

📚 Topics

  • Exponential backoff with a cap; full jitter (thundering herd)
  • Retry budget; transient vs permanent error classification
  • Dead-letter queue (DLQ); idempotency for at-least-once delivery

📖 Reading / Sources

📝 Notes

  • Not every error should be retried. Classify first: a transient failure (timeout, Unavailable, 503) is worth retrying; a permanent one (malformed payload, InvalidArgument, 400) never will succeed, so retrying just burns the budget — dead-letter it immediately → [[error-classification]].
  • Exponential backoff: wait base * 2^attempt, clamped to a max ceiling. Compute it overflow-safe (double iteratively, cap in-loop) so a large attempt count doesn't wrap a time.Duration → [[backoff]].
  • Add jitter. Without it, many clients that failed together retry in lockstep and re-spike the downstream (the thundering herd). "Full jitter" picks a uniform delay in [0, backoff], decorrelating retries → [[jitter]] [[thundering-herd]].
  • A retry budget (max attempts) bounds the work; after it's spent, move the job to a dead-letter queue for human inspection instead of looping forever or silently dropping it → [[dead-letter-queue]].
  • Wrap the final cause with %w so the DLQ record stays inspectable via errors.Is/errors.As; check sentinels with errors.Is(err, ErrPermanent) rather than string matching → [[error-wrapping]].
  • Retries make delivery at-least-once, so handlers must be idempotent: dedupe on a job ID / idempotency key so re-processing the same job twice is harmless → [[idempotency]] [[at-least-once]].
  • Respect context during the wait: sleep with a time.Timer inside a select on ctx.Done() so a shutdown interrupts the backoff immediately instead of blocking for seconds → [[context]].

💻 Code Examples

// Retry with overflow-safe backoff + full jitter + context-aware wait.
// Runnable end-to-end in examples/month-05/retry.
func processWithRetry(ctx context.Context, j core.Job, max int, r *rand.Rand) error {
    const base, cap = 100 * time.Millisecond, 2 * time.Second
    var last error
    for attempt := 0; attempt < max; attempt++ {
        last = handle(ctx, j)
        if last == nil {
            return nil // success
        }
        if errors.Is(last, ErrPermanent) {
            return last // don't retry — caller dead-letters
        }
        if attempt == max-1 {
            break // don't sleep after the final try
        }
        delay := fullJitter(attempt, base, cap, r) // uniform in [0, base*2^attempt]
        t := time.NewTimer(delay)
        select {
        case <-t.C:
        case <-ctx.Done():
            t.Stop()
            return ctx.Err()
        }
    }
    return fmt.Errorf("exhausted %d attempts: %w", max, last) // → DLQ
}

Full backoff schedule, jitter, and DLQ hand-off are runnable: examples/month-05/retry · Run: go run ./examples/month-05/retry

🏋️ Exercises / Practice

Exercise Status Link
Exponential backoff + full jitter (overflow-safe) exercises/month-05/week-4/backoff
Retry budget + dead-letter routing + error wrapping exercises/month-05/week-4/deadletter

🐛 Mistakes Made

  • Computed backoff as base << attempt and a high attempt count overflowed to a negative duration → instant retries. Switched to an iterative double-and-cap that can't overflow.
  • Retried an InvalidArgument (bad payload) the full budget before dead-lettering. Added permanent-error classification to short-circuit.
  • Slept with time.Sleep(delay); shutdown then hung for seconds. Replaced with a Timer + select on ctx.Done().

❓ Open Questions

  • Should the DLQ live in Redis (a separate list) or a durable store, and what re-drive (replay) tooling does it need?

🧠 Active Recall (answer without looking)

  1. Q: Why add jitter to exponential backoff, and what does "full jitter" compute?
AWithout jitter, clients that failed at the same time retry at the same instants and re-overload the downstream (thundering herd). Full jitter picks a random delay uniformly in `[0, base*2^attempt]`, spreading retries out so load is smoothed.
  1. Q: Retries give at-least-once delivery. What property must the handler have, and how do you get it?
AIdempotency: processing the same job more than once must have the same effect as once. Achieve it by deduping on a stable job ID / idempotency key (e.g. an INSERT … ON CONFLICT DO NOTHING, or a "seen" set) so a re-delivered job is a no-op.

🪶 Feynman Reflection

Retrying is like re-knocking on a door that didn't answer — but you wait longer each time (backoff) and at a slightly random moment (jitter) so a crowd doesn't all knock together. You only re-knock if it's worth it (transient error), and after a fixed number of tries you slip a note under the door for someone to look at later (dead-letter) instead of knocking forever. And because you might knock twice, whoever's inside must handle a repeat knock gracefully (idempotency).

🕳️ Knowledge Gaps

  • Choosing max-attempts and cap values from real downstream SLOs rather than guessing.

✅ Summary

I can classify transient vs permanent failures, retry with overflow-safe exponential backoff and full jitter under a budget, interrupt the wait on context cancellation, and dead-letter exhausted jobs with a wrapped, inspectable error — keeping at-least-once delivery safe via idempotent handlers.

⏭️ Next Steps / Prep for Tomorrow

  • Day 138: lock the behavior down with tests and expose operational visibility via metrics.

Time spent Difficulty Confidence
90 min 🟦🟦🟦⬜⬜ 🟦🟦🟦⬜⬜

Suggested commit: feat(worker): retry with backoff/jitter & dead-letter queue (day 137)