u/Signal-Shoe-6670

Retry amplification is one of the most common failure modes I see in event-driven systems
▲ 26 r/Backend

Retry amplification is one of the most common failure modes I see in event-driven systems

One fundamental pattern I keep running into in distributed systems is what I’d call “retry amplification.”

A downstream service starts failing intermittently, and upstream systems retry aggressively.

Instead of containing the issue, you end up with:

- duplicate processing

- inconsistent state

- cascading load across services

The original failure wasn’t the problem (at high volumes failures will always be there). The issue is the retry strategy amplifies the failure.

What’s worked better in practice:

- idempotency at service boundaries

- staged retry queues with backoff

- dead letter queues for persistent failures

- making workflow state explicit so it can resume cleanly

Curious how others handle it at scale besides these ways.

I wrote a more detailed breakdown here (including a simple Excalidraw diagram):

https://norafoundry.dev/papers/how-i-evaluate-distributed-systems

u/Signal-Shoe-6670 — 1 day ago