What an SRE Agent Caught Last Quarter (and What It Missed)

The Azure SRE Agent has been running against our production AKS cluster for one quarter. Three months. About 90 incidents. The agent has been on every one of them as a co-investigator, and we now have enough data to talk about what it actually contributes versus what we wanted it to contribute.

Short version: it's an excellent first responder and a mediocre root-cause finder. Both halves of that sentence matter.

The setup, briefly

The Azure SRE Agent integrates with Azure Monitor, App Insights, AKS, and our incident tooling. When an alert fires, the agent picks it up, runs a battery of diagnostic queries (pod state, recent deploys, error rate trends, log signatures), and posts a structured triage summary to the on-call channel within 30-45 seconds.

The summary always has the same shape: what's broken, what changed in the last 24 hours that might be related, what the agent is "most worried about", and a suggested next action. The on-call engineer gets that summary before they've even opened their laptop.

That's the win. The summary is the win.

The wins, with numbers

Time-to-first-investigation dropped from a median 4m 12s to 38 seconds. That's not "the incident is resolved faster" — it's "the on-call has context faster". Resolution time hasn't moved much, but the experience of being on call is dramatically less stressful when the first 4 minutes aren't spent figuring out which dashboard to open.

It catches deploy-vs-incident correlation we miss. Out of 90 incidents, the agent flagged 17 as "very likely caused by deploy X within the last Y minutes." We confirmed 14 of them. That's an 82% precision rate, which is honestly better than ours during 3am alerts.

It writes the postmortem timeline. This was unexpected. The agent's triage messages, in order, form a credible incident timeline. After we stopped having to write timelines from scratch, postmortem authoring time went from ~2 hours to ~30 minutes. Quality went up, not down — the agent's timestamps are precise in a way ours never were.

The misses, also with numbers

It does not find root causes that require domain knowledge. The clearest example: a customer reported intermittent 500s on a specific endpoint. The agent ran the standard checks — pod health, error rate, dependency latency. All fine. The actual cause was a Postgres connection pool exhaustion that only triggered when a specific tenant's batch job ran against a specific shard. Finding that required someone who knows our tenancy architecture. The agent's triage report concluded "no anomalies detected" while the issue was actively happening.

It is overconfident about Kubernetes events. Out of 90 incidents, the agent flagged "Kubernetes node pressure" as the suspected cause 11 times. It was right twice. The other nine were red herrings — unrelated Warning events that happened to cluster around the incident window. The on-call learned to discount that specific signal, but for the first month it sent us down dead ends.

Cascading failures confuse it. When a downstream dependency degrades, the agent often flags the wrong service as the source. Three of our worst incidents last quarter had triage reports that pointed at the wrong layer. We caught it, but a less experienced on-call might have spent 20 minutes chasing the agent's lead before realizing.

How we use the agent now (the workflow that emerged)

Always read the agent's triage. Never trust the conclusion. Treat the diagnostic summary as data — pod state, recent deploys, error rates. Treat the agent's "most worried about" line as one hypothesis among many.

Always ask: "what would the agent NOT have noticed?" This is the question that catches the Postgres-shard-tenant-batch-job class of incident. If the issue is anything specific to your business logic or your tenancy model, the agent isn't going to find it. You have to.

Always have the agent draft the postmortem timeline. This is its single highest-value contribution. Don't write timelines by hand anymore.

The thing the docs don't tell you

The agent's quality is heavily dependent on what telemetry you've wired into Azure Monitor. We had decent metrics, mediocre logs, and almost no distributed tracing when we started. The agent's triage was correspondingly thin.

When we instrumented OpenTelemetry traces on our service-to-service calls, the agent's ability to flag dependency-related incidents went up sharply. Same agent. Same prompts. More data.

If you're evaluating the SRE Agent and your tracing is weak, fix the tracing first. The agent without traces is a wrapper around kubectl describe. With traces, it's a junior SRE.

The cost question

Roughly $180/month for our usage. That's noise compared to the engineer-time saved on triage and postmortems. We did not do a rigorous ROI analysis. The on-call engineers asked us not to take it away. That was the analysis.

What I'd do next quarter

Build a custom prompt extension that injects our architecture diagram and tenancy model into the agent's context on every incident. The agent doesn't know that tenant 2847 lives on shard 4 and shares a connection pool with tenant 9001. If we feed it that, my hypothesis is that the "no anomalies detected" failure mode goes away.

I would not let the agent take any remediation action automatically. The Azure SRE Agent docs hint at this capability for the future. Even when it lands, we'll wire it through a Logic Apps approval gate (see article #20 — Human-in-the-loop approval gates). The cost of a false positive in remediation is too high.

The honest conclusion

The agent has not made my team faster at solving incidents. It has made my team less tired during incidents, more accurate at correlating deploys, and dramatically faster at writing postmortems. Those three things are not what was promised in the marketing, but they're more valuable than what was promised.

A 4-minute reduction in time-to-first-investigation, applied across 90 incidents at $200/hour engineer time, pays for the agent's annual cost in the first month. Everything after is gravy.

SREAgentsIncident Response

What an SRE Agent Caught Last Quarter (and What It Missed)

The setup, briefly

The wins, with numbers

The misses, also with numbers

How we use the agent now (the workflow that emerged)

The thing the docs don't tell you

The cost question

What I'd do next quarter

The honest conclusion

Conversation

More from DevOps

Five Gotchas When Wiring Azure DevOps MCP Server Into VS Code Copilot

Time-Slicing vs MIG for Bursty LLM Inference Traffic on AKS GPU Node Pools

Swapping ACR for Harbor in an AKS GitOps Pipeline: What Broke, What Didn't