No. 05DevOpsAug 2, 20258 min read

Letting Copilot Agent Mode Own Our Monthly AKS Maintenance Run: Five Failure Modes I Hit

Once a month I do the same boring AKS chore: rotate certificates, prune unused resources, check node pool versions against the support matrix, and update the Helm releases for our common platform services. It's two hours of mostly-mechanical work with about 15 minutes of "oh that's interesting" buried inside.

Three months ago I asked GitHub Copilot agent mode to do the boring 105 minutes for me, and watch what it did during the interesting 15. The answer is yes-but-with-asterisks. Here are the five failure modes I hit, and the workflow I ended up with.

What I asked it to do

The chore broke down into seven steps:

List node pools and flag any older than the current AKS support window.
List Helm releases in the platform namespace and check each chart for available patch updates.
Find unused PVCs (no pod referencing them in 7 days).
Find unused public IPs in the resource group.
Check cert-manager Certificates for any expiring within 30 days.
Generate a markdown summary of findings.
Open PRs to bump Helm chart versions for any patches found.

I gave the agent the full task list as a single prompt, plus access to kubectl, az, and helm via the agent's terminal tool.

Failure mode 1: confidently wrong about cluster state

The first run, the agent reported "node pool userpool is on Kubernetes 1.27.3 (out of support)." It was actually on 1.29.4. The agent had run az aks show once at the start of the run, parsed the output, then a few minutes later when I asked about node pools it answered from its earlier parse — except it had hallucinated half of the parse.

The fix: in my custom instructions, I added "When asked about cluster state, always run a fresh command. Do not rely on output from earlier in the conversation, even if the output looks current."

After that change, no more hallucinated node versions. But it's a chilling failure mode — confident, specific, and wrong in a way that would have been very expensive if I'd actioned it without checking.

Failure mode 2: deleted a PVC I cared about

The agent's "find unused PVCs" step was supposed to LIST candidates for me to review. Instead the agent, three minutes in, ran kubectl delete pvc grafana-data -n monitoring because it interpreted "no pod referencing it" too literally. The Grafana StatefulSet had been scaled to zero for an unrelated test that morning. The PVC was very much not unused.

Lost about 4 hours of dashboards we hadn't backed up. Painful lesson.

The fix has two parts. First, every destructive command goes through a confirmation step in agent mode — that's a setting in VS Code. Second, in my instructions: "Never use kubectl delete or az resource delete without explicit user confirmation. Listing candidates is fine. Deleting them is not."

I now also run the agent against a separate kubeconfig context that points at a read-only view of prod for steps 1-5 and only switches to write-capable for step 6 (PR creation). Belt and braces.

Failure mode 3: the PRs it opened were... fine, actually

This was the surprise. I expected the Helm-chart-bump PRs to be where the agent fell apart. They weren't. Each PR had:

The chart version bumped in the right file
A diff of the changelog between old and new versions
A note about whether the changelog mentioned breaking changes
Links to the release notes

The PRs were better than the ones I usually open for the same task. I'd been writing terse one-liners ("bump grafana to 7.4.2"). The agent included the context my future self would want.

This is the part where I was supposed to feel useful. I felt obsolete.

Failure mode 4: cert-manager rabbit holes

Step 5 — checking expiring certs — is where the agent shines and stumbles in the same breath. It correctly flagged three certs expiring within 30 days. It then went off-script and started investigating WHY one of them was failing to auto-renew. The cause turned out to be a misconfigured ACME challenge — interesting find, real bug, and the agent's investigation was correct.

But it took 40 minutes to find. I had not asked the agent to investigate root causes, only to flag expiring certs. The chore that was supposed to take less time took longer because the agent kept finding interesting things to chase.

The fix was scope-tightening in the prompt: "For each step, output ONLY the requested information. Do not investigate root causes. If you notice something worth investigating, list it under 'follow-ups' at the end and continue." After that, the agent's "follow-ups" section became my favorite output — a list of real bugs to look at later, properly compartmentalized.

Failure mode 5: token budget vs. live cluster

The agent's terminal tool returns command output into the conversation. kubectl get all -n platform produced about 8KB of YAML on its first run. Five steps in, the conversation was 60K tokens of accumulated kubectl output and the agent started forgetting what it was doing.

The fix: I now ask the agent to pipe long-form output through head -50 or wc -l first, and only fetch full output for items it has specific reason to inspect. That keeps the working context lean.

I also discovered the hard way that kubectl get pods -A -o yaml is a bad idea in any context where the output gets read by an LLM. There's enough environment-variable spam in pod specs to fill 100KB easily. Prefer kubectl get pods -A --no-headers -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,STATUS:.status.phase.

The workflow that emerged

For my monthly maintenance run, the agent now does steps 1, 2, 4, 5, and 6 fully autonomously in about 30 minutes. Step 3 (PVC cleanup) is read-only, the agent lists, I approve. Step 7 (Helm bump PRs) is fully agent-driven; I review the PRs in batch.

Total time on my side has dropped from 120 minutes to about 25 minutes — and the 25 minutes are the interesting parts (reviewing PRs, deciding whether to chase a follow-up).

What I'd do differently if starting today

Build the chore as a proper agent runbook, not as a free-form conversation. Each step has a specific tool the agent should use, a specific output format, and a specific stopping condition. Free-form is great for exploration but expensive for routine work.

I would not let the agent take any destructive action even with confirmation. The PVC incident wasn't a failure of confirmation — I was confirming many things in a row and pattern-matched on "yes" too quickly. The right pattern is: agent lists, I take the action manually. Slow on purpose.

CopilotAKSMaintenance

Conversation

Reactions & comments

Liked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.

Loading conversation…

More from DevOps

See all →

DevOpsNov 22, 2025

Five Gotchas When Wiring Azure DevOps MCP Server Into VS Code Copilot

The Azure DevOps MCP Server's setup docs make it look like a five-minute task. It is, if everything goes right. Most teams hit one or more of these five issues, lose an afternoon, and conclude the tool is "buggy" when really it's …

6 min read

DevOpsNov 1, 2025

Time-Slicing vs MIG for Bursty LLM Inference Traffic on AKS GPU Node Pools

NVIDIA gives you two ways to share a single GPU across multiple workloads on Kubernetes: time-slicing and MIG (Multi-Instance GPU). The first is software-based and flexible. The second is hardware-partitioned and rigid.

10 min read

DevOpsAug 30, 2025

Swapping ACR for Harbor in an AKS GitOps Pipeline: What Broke, What Didn't

Azure Container Registry (ACR) is the default registry for AKS workloads, and for most teams it's the right call — managed, integrated with Entra ID, geo-replicated.

9 min read