Our P2 incident playbook used to be a 14-step Confluence page. It pointed at four Azure portals, three KQL queries, two PowerShell scripts, and a Slack channel. The first 12 minutes of every incident were spent fetching the same context, in the same order, by whichever engineer drew the short straw.
We replaced the first 12 minutes with a single chat message. The infra: GitHub Copilot agent mode in VS Code, three internal MCP servers (Azure Monitor, Azure DevOps, AKS), and a small set of .github/instructions files that describe how to triage. This is the wiring and the four things we tightened along the way.
The shape
Slack alert ──▶ PagerDuty ──▶ Engineer opens VS Code
│
▼
Copilot agent mode (Claude / GPT-5)
│
┌──────────────────────────┼──────────────────────────┐
▼ ▼ ▼
monitor-mcp ado-mcp aks-mcp
(KQL, alerts, logs) (work items, deploys) (events, pods, top)
Three MCP servers, each thin (~400 LOC), each scoped to one platform. The agent picks tools by description, the engineer asks one question, the runbook starts itself.
The triage instruction file
<!-- .github/instructions/incident-triage.md -->
You are helping an on-call engineer triage a Severity 2 incident.
When the engineer says "triage <alertId>" or shares an alert link, do this:
1. Call `monitor.get_alert` with the alert ID.
2. Read the alert's signal type and resource ID.
3. If the resource is an AKS workload, call `aks.recent_events`
on that namespace for the last 30 minutes and `aks.top_pods`
for memory/CPU.
4. If a deployment ran in the last 60 minutes, call
`ado.deployments_recent` for the same target.
5. Call `monitor.run_kql` with the saved query
`IncidentTriage_${signalType}` if one exists.
Return a short summary of facts (no recommendations yet) in this shape:
- **Alert:** title, severity, fired at
- **Resource:** kind, name, region
- **Recent deploy:** yes/no, who, when
- **AKS state:** unhealthy pod count, top events
- **KQL:** one-line summary of the query result
Five steps, all retrieval. No recommendations, that's the deliberate choice. The engineer makes the call; the agent fetches the facts.
The MCP tool that pulls saved KQL
{
name: "monitor_run_saved_kql",
description:
"Runs a query saved in the Log Analytics workspace by name. " +
"Use names like `IncidentTriage_Availability` or `IncidentTriage_Latency`.",
inputSchema: {
type: "object",
required: ["workspaceId", "queryName"],
properties: {
workspaceId: { type: "string" },
queryName: { type: "string" },
bindings: { type: "object", description: "params for parameterised queries" },
},
},
}
async function monitorRunSavedKql(args: SavedKqlArgs) {
const tokens = await credential.getToken("https://api.loganalytics.io/.default");
// Load the query body from the workspace's savedSearches
const armToken = await credential.getToken("https://management.azure.com/.default");
const meta = await fetch(
`https://management.azure.com${args.workspaceId}/savedSearches/${args.queryName}` +
`?api-version=2020-08-01`,
{ headers: { Authorization: `Bearer ${armToken.token}` } }
).then((r) => r.json());
const queryBody = applyBindings(meta.properties.query, args.bindings);
// Run it
const res = await fetch(
`https://api.loganalytics.io/v1/workspaces/${meta.properties.workspaceCustomerId}/query`,
{
method: "POST",
headers: {
Authorization: `Bearer ${tokens.token}`,
"Content-Type": "application/json",
},
body: JSON.stringify({ query: queryBody, timespan: "PT1H" }),
}
);
const data = await res.json();
return summariseKqlResult(data);
}
Two API calls, one to fetch the saved search, one to run it. The advantage: the KQL itself lives in the workspace, version-controlled by Bicep, owned by SRE. The MCP server is the engine, not the source of truth.
The deploy-correlation tool
{
name: "ado_deployments_recent",
description:
"List Azure DevOps deployments to a target environment in the last N minutes. " +
"Use to correlate an incident with a recent release.",
inputSchema: {
type: "object",
required: ["project", "environment"],
properties: {
project: { type: "string" },
environment: { type: "string" },
lastMinutes: { type: "number", default: 60 },
},
},
}
The first time this returned a hit was the third real incident. The agent reported: "deploy payments-api/2026-04-12.3 rolled to prod 11 minutes before the alert fired, by name@example.com", and the engineer had a rollback window before they finished reading the alert.
The four things we tightened
1. The triage instruction was too prescriptive
The first version told the agent to recommend a remediation step. It worked on simple incidents and was wrong about half the time on complex ones, confidently. We cut it to retrieval-only. Recommendations come from the engineer reading the facts. Confidence calibration matters more than helpfulness during an active incident.
2. Tool descriptions were too long
I wrote tool descriptions like documentation. Models pick tools more reliably when descriptions are tight, declarative, and example-led:
- description: "Returns recent deployments. Useful for correlating an incident with a release. Filters by environment, time window, and pipeline name. Pagination supported."
+ description: "List Azure DevOps deployments to an environment in the last N minutes. Use to correlate an incident with a recent release."
The shorter version had a 22% better tool-pick rate on a fixed set of triage prompts.
3. The agent mode didn't see workflow context
Copilot agent mode runs in the editor, but incidents start in PagerDuty. We added a small VS Code extension that registers a urilink: handler, paging an engineer now produces a link that opens VS Code with the alert ID prefilled in a Copilot prompt. From PagerDuty buzz to triage start: 15 seconds.
4. Read-only is non-negotiable for the first iteration
Every tool the agent has during an incident is read-only. No restart, no scale, no rollback. We will eventually let the agent suggest an action by drafting a runbook step the engineer one-clicks, but we will not let the agent execute anything during a live incident. The blast-radius asymmetry is too steep.
Numbers, after a quarter
- Median time-to-context (alert fired → engineer has all triage facts): 14m → 4m
- Wrong-rollback rate (rollbacks of an unrelated deploy): unchanged (still ~5%), the agent doesn't make the call, the engineer still does
- MCP server uptime during incidents: 99.4% (the 0.6% was a separate Container Apps revision rollout colliding with an incident)
What I'd do differently
Add a post-incident summarise Tool that takes the alert ID and a short engineer narrative, and emits a draft postmortem with timeline pulled from the same MCP servers. We're prototyping this now; the savings on writing-time alone justify the work. The risk is the same as recommendations during the incident, the agent must summarise facts, not editorialise.
I would NOT skip the read-only constraint just because the team has gotten comfortable with the agent. The day the agent decides on its own to "scale down a noisy ReplicaSet" at 3am during an incident is the day someone files a P0 about the agent.

Conversation
Reactions & commentsLiked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.