Day 137 — Retries & Dead-Letter¶
Month 5 · Week 4 · ⬅ Day 136 · Day 138 ➡ · Journal index
🎯 Learning Objective¶
Make job processing resilient: retry transient failures with exponential backoff + jitter under a budget, classify permanent failures, and route exhausted jobs to a dead-letter queue.
📚 Topics¶
- Exponential backoff with a cap; full jitter (thundering herd)
- Retry budget; transient vs permanent error classification
- Dead-letter queue (DLQ); idempotency for at-least-once delivery
📖 Reading / Sources¶
- AWS Builders' Library — Timeouts, retries, and backoff with jitter
- Go blog — Error handling and Go (
%w, sentinels) -
errorspackage (Is/As/Unwrap) - Enterprise Integration Patterns — Dead Letter Channel
📝 Notes¶
- Not every error should be retried. Classify first: a transient failure (timeout,
Unavailable, 503) is worth retrying; a permanent one (malformed payload,InvalidArgument, 400) never will succeed, so retrying just burns the budget — dead-letter it immediately → [[error-classification]]. - Exponential backoff: wait
base * 2^attempt, clamped to amaxceiling. Compute it overflow-safe (double iteratively, cap in-loop) so a large attempt count doesn't wrap atime.Duration→ [[backoff]]. - Add jitter. Without it, many clients that failed together retry in lockstep and re-spike the downstream (the thundering herd). "Full jitter" picks a uniform delay in
[0, backoff], decorrelating retries → [[jitter]] [[thundering-herd]]. - A retry budget (max attempts) bounds the work; after it's spent, move the job to a dead-letter queue for human inspection instead of looping forever or silently dropping it → [[dead-letter-queue]].
- Wrap the final cause with
%wso the DLQ record stays inspectable viaerrors.Is/errors.As; check sentinels witherrors.Is(err, ErrPermanent)rather than string matching → [[error-wrapping]]. - Retries make delivery at-least-once, so handlers must be idempotent: dedupe on a job ID / idempotency key so re-processing the same job twice is harmless → [[idempotency]] [[at-least-once]].
- Respect context during the wait: sleep with a
time.Timerinside aselectonctx.Done()so a shutdown interrupts the backoff immediately instead of blocking for seconds → [[context]].
💻 Code Examples¶
// Retry with overflow-safe backoff + full jitter + context-aware wait.
// Runnable end-to-end in examples/month-05/retry.
func processWithRetry(ctx context.Context, j core.Job, max int, r *rand.Rand) error {
const base, cap = 100 * time.Millisecond, 2 * time.Second
var last error
for attempt := 0; attempt < max; attempt++ {
last = handle(ctx, j)
if last == nil {
return nil // success
}
if errors.Is(last, ErrPermanent) {
return last // don't retry — caller dead-letters
}
if attempt == max-1 {
break // don't sleep after the final try
}
delay := fullJitter(attempt, base, cap, r) // uniform in [0, base*2^attempt]
t := time.NewTimer(delay)
select {
case <-t.C:
case <-ctx.Done():
t.Stop()
return ctx.Err()
}
}
return fmt.Errorf("exhausted %d attempts: %w", max, last) // → DLQ
}
Full backoff schedule, jitter, and DLQ hand-off are runnable:
examples/month-05/retry· Run:go run ./examples/month-05/retry
🏋️ Exercises / Practice¶
| Exercise | Status | Link |
|---|---|---|
| Exponential backoff + full jitter (overflow-safe) | ✅ | exercises/month-05/week-4/backoff |
| Retry budget + dead-letter routing + error wrapping | ✅ | exercises/month-05/week-4/deadletter |
🐛 Mistakes Made¶
- Computed backoff as
base << attemptand a high attempt count overflowed to a negative duration → instant retries. Switched to an iterative double-and-cap that can't overflow. - Retried an
InvalidArgument(bad payload) the full budget before dead-lettering. Added permanent-error classification to short-circuit. - Slept with
time.Sleep(delay); shutdown then hung for seconds. Replaced with aTimer+selectonctx.Done().
❓ Open Questions¶
- Should the DLQ live in Redis (a separate list) or a durable store, and what re-drive (replay) tooling does it need?
🧠 Active Recall (answer without looking)¶
- Q: Why add jitter to exponential backoff, and what does "full jitter" compute?
A
Without jitter, clients that failed at the same time retry at the same instants and re-overload the downstream (thundering herd). Full jitter picks a random delay uniformly in `[0, base*2^attempt]`, spreading retries out so load is smoothed.- Q: Retries give at-least-once delivery. What property must the handler have, and how do you get it?
A
Idempotency: processing the same job more than once must have the same effect as once. Achieve it by deduping on a stable job ID / idempotency key (e.g. an INSERT … ON CONFLICT DO NOTHING, or a "seen" set) so a re-delivered job is a no-op.🪶 Feynman Reflection¶
Retrying is like re-knocking on a door that didn't answer — but you wait longer each time (backoff) and at a slightly random moment (jitter) so a crowd doesn't all knock together. You only re-knock if it's worth it (transient error), and after a fixed number of tries you slip a note under the door for someone to look at later (dead-letter) instead of knocking forever. And because you might knock twice, whoever's inside must handle a repeat knock gracefully (idempotency).
🕳️ Knowledge Gaps¶
- Choosing max-attempts and cap values from real downstream SLOs rather than guessing.
✅ Summary¶
I can classify transient vs permanent failures, retry with overflow-safe exponential backoff and full jitter under a budget, interrupt the wait on context cancellation, and dead-letter exhausted jobs with a wrapped, inspectable error — keeping at-least-once delivery safe via idempotent handlers.
⏭️ Next Steps / Prep for Tomorrow¶
- Day 138: lock the behavior down with tests and expose operational visibility via metrics.
| Time spent | Difficulty | Confidence |
|---|---|---|
| 90 min | 🟦🟦🟦⬜⬜ | 🟦🟦🟦⬜⬜ |
Suggested commit: feat(worker): retry with backoff/jitter & dead-letter queue (day 137)