Multi-Region AKS-Only GitOps With Azure Arc: A Drift-Reconciliation War Story

We run AKS in three regions: West Europe, East US, and Australia East. The promise of GitOps with Azure Arc is "one Git repo, three clusters, drift gets reconciled automatically." The reality is more interesting and considerably more annoying than the marketing copy suggests.

This is the story of a 14-hour drift-reconciliation incident, what we learned about Azure Arc's reconciliation behavior across regions, and the architecture we ended up with that survives reality.

The setup that caused the incident

Three AKS clusters, all Arc-attached. One Git repo (platform/k8s-config) with subfolders per cluster: clusters/weu/, clusters/eus/, clusters/aue/. Each subfolder contained a Flux v2 Kustomization manifest pointing the Arc agent at that subfolder.

The shared baseline (everything that should be identical across clusters — RBAC, network policies, common services) lived in clusters/shared/. Each cluster's Kustomization referenced clusters/shared/ as a base and overlaid its region-specific differences.

Looks fine on paper. Was fine for six months. Then it wasn't.

The incident

A platform engineer (me, on a Tuesday) merged a PR that updated a NetworkPolicy in clusters/shared/. The intent was to allow egress from a specific namespace to a new internal service.

What happened:

West Europe cluster picked up the change in ~90 seconds, applied it, traffic worked.
East US cluster picked up the change in ~2 minutes, applied it, traffic worked.
Australia East cluster picked up the change in 4 minutes... but the apply failed. The cluster's Arc agent had drifted into a state where it couldn't reach the Git repo.

I didn't notice immediately. The PR merged, two clusters accepted, I went to lunch.

While I was at lunch, an automated maintenance script in Australia East ran. It expected the new NetworkPolicy to exist (because the script itself had been updated to use the new policy). The policy didn't exist. The script created its own version of it — slightly different from what the GitOps version would have looked like.

After lunch, the Arc agent in Australia East recovered (a transient network issue with the storage backing the Arc state). It pulled the latest Git state. It saw that the NetworkPolicy existed but didn't match what Git said it should be. It "fixed" it — overwriting the version the maintenance script had created.

The maintenance script had made downstream resources that depended on its specific version of the policy. Those resources started failing.

By the time I noticed (~2 hours later), Australia East had:

The Git-reconciled version of the NetworkPolicy (correct from the GitOps perspective)
A bunch of pods crash-looping because they expected the maintenance-script's version
The maintenance script attempting to "fix" this by re-creating its version every 5 minutes
Arc reconciling it back to Git's version every 90 seconds

A reconciliation loop, with neither party "winning." Australia East was effectively offline for the affected workloads. Total time to identify and resolve: ~14 hours, half of which was finding the right people in the right time zones.

What I learned about Arc reconciliation

Arc's reconciliation interval is not synchronous across regions. Each cluster's agent pulls on its own schedule. A merge to Git lands on different clusters at different times. For a NetworkPolicy change that affects active traffic, this asymmetry can mean a window where some clusters allow traffic and others don't.

Arc reconciliation does not have a "stop if downstream is broken" check. It applies what's in Git regardless of whether the apply will break things. There is no built-in way to say "if this change is going to cause >5% of pods in this namespace to crash, halt and alert."

Drift between Arc-reconciled state and out-of-band changes is a war between two controllers. Whichever runs more recently wins the moment, then loses it 90 seconds later. The system never converges.

The Arc agent's own state can drift. The Arc agent stores its reconciliation state in a backing store. If that store has issues (disk pressure, network partition, etc.), the agent can lose track of "what I last applied" and either re-apply incorrectly or stop applying altogether.

The architecture we moved to

After the incident, we changed three things:

1. Per-region staging windows. Changes to clusters/shared/ no longer apply simultaneously across regions. The Kustomization in each cluster now references a region-specific Git tag (e.g., weu-shared-2026-04-15-1400). The CI pipeline updates these tags one region at a time, with a 15-minute delay between regions and an automated rollback if any cluster reports unhealthy after a tag bump.

# clusters/weu/kustomization.yaml
bases:
  - https://github.com/[YOUR-COMPANY]/k8s-config//clusters/shared?ref=weu-shared-stable
patches:
  - patch.yaml

The weu-shared-stable tag is moved by CI after validation. Same for eus-shared-stable and aue-shared-stable. Effectively a canary deploy of GitOps changes, region by region.

2. Drift-detection alerts that fire when reconciliation is fighting. A Prometheus rule in each cluster watches the Arc agent's reconciliation logs. If the same resource is being reconciled more than 5 times in 10 minutes, that's a fight, not a normal reconciliation. Page the on-call.

- alert: ArcReconciliationFight
  expr: rate(flux_reconcile_duration_seconds_count{name="shared"}[10m]) > 0.5
  for: 5m

This would have caught the Australia East incident in minutes, not hours.

3. Maintenance scripts no longer create their own versions of things. This was the actual root cause of the incident. The maintenance script that "expected the new NetworkPolicy" should have failed loudly when the policy didn't exist, not invented its own. We audited every maintenance script and pulled out 11 places where scripts were creating resources that GitOps was supposed to own. They now check-and-fail instead of check-and-create.

One region first. Run Arc + Flux on a single AKS cluster for at least a month. Get comfortable with the reconciliation lag, the drift-detection patterns, and the failure modes. Add regions one at a time after that.

Tag-based promotion across regions. From day 1, design your Git layout so that each cluster references a tag, not the main branch. The cost of doing this on day 1 is small. The cost of refactoring to it after an incident is large.

Automated rollback gates. Every Git-tag bump is gated on cluster health checks. If kubectl get pods -A shows >5% restart rate increase in the 10 minutes after a bump, the tag is rolled back automatically.

Audit your "out of band" changes. Any script, controller, or human action that mutates cluster state outside Git is a future drift fight. List them. Reduce them. The ones you can't reduce, document them so the on-call knows what's GitOps and what isn't.

What didn't help

More dashboards. Adding more Grafana panels for Arc reconciliation didn't help us catch the Australia East incident faster. The signal was buried in normal-looking reconciliation activity. The alert (above) caught it faster than any dashboard would have.

Stricter PR review. The PR that started the incident was reviewed and approved by two engineers. There was nothing wrong with the PR. The problem was the maintenance-script behavior nobody knew about. PR review didn't fail; documentation failed.

The bigger picture

Multi-region GitOps with Arc works, but it doesn't work the way the docs imply. The docs describe a steady-state reconciliation. Real life is the transient states between reconciliations, multiplied by N regions, multiplied by every controller that thinks it owns part of cluster state.

If you have one AKS cluster, GitOps is straightforward. If you have three, you have a distributed system, and distributed systems are harder than people remember.

AKSAzure ArcMulti-Region

Multi-Region AKS-Only GitOps With Azure Arc: A Drift-Reconciliation War Story

The setup that caused the incident

The incident

What I learned about Arc reconciliation

The architecture we moved to

What didn't help

The bigger picture

Conversation

More from Azure

Air-Gapped Azure OpenAI With Private Endpoints: A Terraform Module That Actually Works

The setup that caused the incident

The incident

What I learned about Arc reconciliation

The architecture we moved to

What I'd recommend if you're starting

What didn't help

The bigger picture

Conversation

More from Azure

Air-Gapped Azure OpenAI With Private Endpoints: A Terraform Module That Actually Works