Azure’s infrastructure is genuinely reliable. That’s exactly the problem. The more stable the platform, the easier it is to mistake platform health for system health, and that gap is where the expensive outages live. Availability is an architectural choice, not a SKU.
Retries are load, not safety. Without exponential backoff and jitter, your retry logic doesn’t protect against outages, it causes them. This post covers the mechanics of retry storms, five anti-patterns found in real production code, and what correct retry design actually looks like across layered Azure architectures.