KQL for Adults: Writing Queries That Don't Lie to You

March 17, 2026 in Azure, Site Reliability Engineering | Reading time: 8 minutes

Most KQL running in production is subtly wrong. Wrong operators, unscoped subqueries, and alert rules that silently miss events due to ingestion latency. Here’s how to write queries you can actually defend.

Why Your Application Gateway Logs Don't Tell the Whole Story (Until You Correlate Them)

March 10, 2026 in Azure, Site Reliability Engineering | Reading time: 8 minutes

Access logs, firewall logs, backend health, and metrics each tell a partial truth about what Application Gateway is doing. Here’s how they mislead you in isolation, and the KQL that fixes that.

Your Alerts Are a Product. They're Just a Bad One.

February 24, 2026 in Azure, Systems Design, Site Reliability Engineering | Reading time: 14 minutes

Alert fatigue isn’t a people problem, it’s a product design failure. Your on-call engineers are the users. Here’s why noisy alerts are biologically inevitable under bad design, and what treating alerting as a product actually looks like.

Your DR Plan Has Never Been Tested

February 10, 2026 in Azure, Architecture, Site Reliability Engineering | Reading time: 17 minutes

Most Azure DR tests confirm the secondary came up. They don’t confirm your RTO is real, your RPO commitment holds under load, or that failback won’t silently destroy the incident window. Here’s how to test DR honestly, with exit criteria that actually prove the plan works.

The Hidden Cost of 'Retry Everything': How Naive Retry Logic Creates a Self-Inflicted DDoS

February 3, 2026 in Azure, Architecture, Site Reliability Engineering | Reading time: 21 minutes

Retries are load, not safety. Without exponential backoff and jitter, your retry logic doesn’t protect against outages, it causes them. This post covers the mechanics of retry storms, five anti-patterns found in real production code, and what correct retry design actually looks like across layered Azure architectures.

Autoscaling Is Not a Recovery Strategy

January 20, 2026 in Azure, Architecture, Site Reliability Engineering | Reading time: 20 minutes

Autoscaling is not a recovery strategy. It’s an elasticity tool, and knowing the difference is what separates teams that survive incidents from teams that just watch their instance count go up while users experience the outage anyway.