Skip to content
A
No. 06DevOpsAug 9, 202511 min read

Migrating an AKS Cluster Off Flux v2 to the New ArgoCD Extension Without Dropping Reconciliation

When the ArgoCD extension for AKS hit GA at KubeCon Europe 2026, we had four production AKS clusters running Flux v2 GitOps and a long-standing internal preference for ArgoCD's UI for application-team developers.

When the ArgoCD extension for AKS hit GA at KubeCon Europe 2026, we had four production AKS clusters running Flux v2 GitOps and a long-standing internal preference for ArgoCD's UI for application-team developers. The migration sounded straightforward in the announcement: add the extension, point it at the same Git repo, switch over.

It was not straightforward. Here's what we did, what broke, and what I'd recommend to anyone planning the same migration on a live cluster.

The premise

Both Flux v2 and ArgoCD reconcile a Git repo into a Kubernetes cluster. In theory, swapping one for the other is just changing the controller. In practice, the two tools have meaningfully different opinions about:

  1. What "owns" a resource. Flux uses kustomize.toolkit.fluxcd.io/name labels; ArgoCD uses app.kubernetes.io/instance. If both controllers think they own the same resource, you get fight-thrash.
  2. What "deletion" means. Flux v2's prune: true and ArgoCD's prune policy look the same but interact with finalizers differently.
  3. Drift. ArgoCD's "auto-sync with self-heal" is roughly equivalent to Flux's reconciliation but the latency profiles differ.

A naive migration — install ArgoCD, point at repo, uninstall Flux — produces a 5-minute window where neither controller owns anything and a fresh deploy will create duplicate resources.

The plan we used

Four phases over four weeks, one cluster per week:

Week 1 — Dev cluster, big-bang switchover. Acceptable downtime. Used the dev cluster as a learning vehicle.

Week 2 — Staging cluster, parallel run. Both controllers active simultaneously, one watching, one applying.

Week 3 — First prod cluster, parallel run + cutover. Same pattern as staging but with a real cutover decision based on observed behavior.

Week 4 — Three remaining prod clusters. Repeat the validated pattern.

The key insight that made this work was the "parallel run" idea: Flux still reconciles, ArgoCD watches but doesn't apply. Compare what each thinks the desired state is. Catch the divergences before flipping the switch.

The setup, concretely

Install the AKS extension:

az k8s-extension create \
  --name argocd \
  --extension-type Microsoft.ArgoCD \
  --cluster-name aks-prod-eus-01 \
  --resource-group rg-prod-eus-platform \
  --cluster-type managedClusters \
  --auto-upgrade-minor-version true

Configure ArgoCD to watch the same Git repo Flux watches, but with syncPolicy: { automated: null } — manual sync only, for now. The Application CRDs install. ArgoCD shows you the cluster's state vs. the repo's state. You can see drift visually but ArgoCD won't act on it.

We let this run for a week on staging. Three findings emerged:

  1. Sealed Secrets handling differed. Flux v2 was decrypting them via sops-age at sync time. ArgoCD wanted them via the Sealed Secrets controller in-cluster. We had to deploy the controller and re-seal a handful of secrets.

  2. Helm chart "values" were merged differently. Flux's HelmRelease merges values from CR + ConfigMap + per-environment overlay. ArgoCD's Application has a single values stanza. We had to re-flatten our values setup. About 200 lines of YAML across 12 charts.

  3. CRDs without finalizers behaved differently on resource updates. ArgoCD's default sync would patch them; Flux had been doing a delete-and-recreate. Switching ArgoCD to replace=true for a specific subset of CRDs got us aligned.

The cutover

After parallel run validated, the cutover was three steps:

  1. Set ArgoCD Application syncPolicy.automated.selfHeal: true.
  2. Disable Flux reconciliation (flux suspend kustomization for every kustomization).
  3. Wait 24 hours. Watch error rates, deploy success rate, drift detection.
  4. Uninstall Flux: flux uninstall --silent.

The 24-hour wait with Flux suspended-but-installed is the safety net. If ArgoCD misbehaves, you re-enable Flux and you're back in a known-good state in 90 seconds.

What broke despite the plan

Helm hook ordering. Three of our charts have helm.sh/hook annotations for pre-install/post-install jobs. Flux respects these annotations. ArgoCD respects them differently — specifically, it runs them according to ArgoCD's sync waves, which don't align with Helm's hook weights. One chart's post-install job ran before the deployment was actually rolled out, hit a not-yet-ready endpoint, and crash-looped.

The fix was adding argocd.argoproj.io/sync-wave: "5" annotations on the relevant resources to coordinate. Took an hour to figure out, ten minutes to fix.

ResourceQuotas getting fights. One namespace had a ResourceQuota set by Flux's HelmRelease, and ArgoCD started "correcting" the limits because it interpreted Flux's last-applied state differently. We had to add argocd.argoproj.io/compare-options: IgnoreExtraneous to that resource. One-line annotation.

Slow first sync after cutover. ArgoCD took ~12 minutes to do its first full reconciliation after we removed Flux's suspend. Felt long. Watching the controller logs showed it was working, just slow because of the volume of resources to inventory. By the third sync it was down to ~30 seconds.

The numbers

  • Total engineering time across 4 clusters: ~28 hours over 4 weeks.
  • Production downtime: zero. The parallel-run pattern paid for itself.
  • Resources reconciled across all four clusters: about 1,400 unique objects.
  • Issues found in dev: 3 (sealed secrets, helm values, CRD update behavior).
  • Issues found in production that weren't surfaced in dev: 2 (helm hooks, resource quota fights).

The "issues found in prod that dev didn't surface" is the cost of having dev workloads that don't exercise the same patterns as prod. Worth budgeting an extra 4-6 hours per prod cluster for unknown unknowns.

Why we did it

ArgoCD's UI is significantly better for application teams than Flux's CLI-and-Grafana setup. Self-service triage of "why is my deploy stuck?" went from ten Slack messages to "look at the ArgoCD UI for your namespace."

The ROI is in reduced platform-team interrupt rate, not in any technical capability we didn't have before. Flux v2 is excellent. We didn't migrate because Flux was bad; we migrated because ArgoCD's developer surface fit our team better.

What I'd do differently

Run the parallel-run for two weeks on staging, not one. The Helm hook ordering issue would have shown up in week two if we'd waited.

I would NOT migrate during a freeze window or a quarter-end. The 24-hour wait period assumes you can roll back if needed; rollback during a code freeze means coordinating with everyone who has merges queued.

AKSGitOpsArgoCDFlux

Conversation

Reactions & comments

Liked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.

Loading conversation…

More from DevOps

See all →