The Architecture Review You Actually Need (But Never Get)

March 31, 2026 in Architecture, Engineering Practice | Reading time: 9 minutes

Most architecture reviews are compliance theatre. They arrive late, check the wrong things, and produce feedback nobody acts on. Here’s what a review that actually challenges a design looks like, and why the problem is structural, not personal.

Everyone Owns Cloud Security. That's Why Nobody Does.

March 24, 2026 in Security, Architecture | Reading time: 9 minutes

165 organisations got hit in the Snowflake breach using no novel attack — just stolen credentials, no MFA, and nobody watching. The shared responsibility model didn’t fail technically. It failed organisationally. Security wrote the policy. Engineering assumed someone reviewed it. The platform team figured ‘managed’ meant secured. Procurement filed the SOC 2 and called it done. Nobody lied. Nobody was negligent. They just each assumed someone else had it.

Azure Will Stay Up. Your System Is a Different Story.

February 17, 2026 in Azure, Architecture | Reading time: 15 minutes

Azure’s infrastructure is genuinely reliable. That’s exactly the problem. The more stable the platform, the easier it is to mistake platform health for system health, and that gap is where the expensive outages live. Availability is an architectural choice, not a SKU.

Your DR Plan Has Never Been Tested

February 10, 2026 in Azure, Architecture, Site Reliability Engineering | Reading time: 17 minutes

Most Azure DR tests confirm the secondary came up. They don’t confirm your RTO is real, your RPO commitment holds under load, or that failback won’t silently destroy the incident window. Here’s how to test DR honestly, with exit criteria that actually prove the plan works.

The Hidden Cost of 'Retry Everything': How Naive Retry Logic Creates a Self-Inflicted DDoS

February 3, 2026 in Azure, Architecture, Site Reliability Engineering | Reading time: 21 minutes

Retries are load, not safety. Without exponential backoff and jitter, your retry logic doesn’t protect against outages, it causes them. This post covers the mechanics of retry storms, five anti-patterns found in real production code, and what correct retry design actually looks like across layered Azure architectures.

Autoscaling Is Not a Recovery Strategy

January 20, 2026 in Azure, Architecture, Site Reliability Engineering | Reading time: 20 minutes

Autoscaling is not a recovery strategy. It’s an elasticity tool, and knowing the difference is what separates teams that survive incidents from teams that just watch their instance count go up while users experience the outage anyway.