Swapping ACR for Harbor in an AKS GitOps Pipeline: What Broke, What Didn't

Azure Container Registry (ACR) is the default registry for AKS workloads, and for most teams it's the right call — managed, integrated with Entra ID, geo-replicated. We ran on ACR for two years before deliberately migrating to a self-hosted Harbor instance.

I want to be honest: we didn't migrate because Harbor was technically superior. We migrated because we wanted vulnerability scanning across multiple registries (ACR, GHCR, Docker Hub) with a single policy engine, and Harbor's project-scoped vulnerability gates fit that need better than Microsoft Defender for Cloud's container scanning at the time.

If you're considering a similar migration — or just curious whether the ACR-to-Harbor path is worth it for your team — here's the honest report on what changed.

The migration shape

Our setup before:

ACR crprodeus.azurecr.io with ~140 images
AKS pulls via the AKS-attached ACR integration (no PAT, no secret)
GitHub Actions builds and pushes via federated credentials
Microsoft Defender for Cloud scans on push

Our setup after:

Harbor running on AKS in its own namespace, behind an internal load balancer
AKS pulls from harbor.platform.internal/<project>/<image>
GitHub Actions builds and pushes via OIDC-issued robot account tokens
Trivy scanner running in Harbor with project-level vulnerability gates

The migration ran for six weeks. Three of those weeks were the actual move. The other three were the things nobody warns you about.

Week 1-2: standing up Harbor properly

Harbor on AKS has a Helm chart that gets you 80% of the way there. The other 20% is everything that breaks when Harbor itself goes down.

Specifically: Harbor stores its registry data in object storage. We pointed it at an Azure Storage account. The storage account needed to be private-endpoint-only (because we don't expose ports to the internet for things storing image layers). The private endpoint needed VNET peering to Harbor's namespace network. Then Harbor's database (PostgreSQL) needed similar treatment.

The Harbor HA story is also more your problem than the ACR HA story. If your AKS cluster goes down, Harbor goes down with it. We mitigated by running Harbor on a separate node pool with explicit pod anti-affinity, plus geo-replication of the storage account. Not as robust as ACR's geo-replication. Adequate for our needs.

Week 3: the registry migration itself

Migrating images from ACR to Harbor is a regctl (or crane) loop:

for repo in $(az acr repository list -n crprodeus -o tsv); do
  for tag in $(az acr repository show-tags -n crprodeus --repository $repo -o tsv); do
    crane copy crprodeus.azurecr.io/$repo:$tag harbor.platform.internal/migrated/$repo:$tag
  done
done

About 6 hours of runtime for our 140 images and ~2,400 tags. We ran this in a CronJob in AKS so the bandwidth stayed in-region.

Validation step: we kept ACR live for two weeks after migration. Both registries had identical images. Pulls went to Harbor; if anything was missing from Harbor, the image pull would fail loudly rather than silently fall back to ACR. (We did NOT configure registry mirrors. Mirrors hide problems.)

Week 4: workload cutover

Updating image references in YAML across all our Helm charts and ArgoCD applications was the boring part — about 80 files touched, mostly find-and-replace. The interesting part was authentication.

ACR's AKS integration meant the kubelet handled auth invisibly. Harbor needed an imagePullSecret per namespace. We standardized on a single secret per namespace with a robot account that had read-only pull permissions to the relevant Harbor projects. Created and managed by ASO + a custom controller.

This is where most teams stumble. The secret rotation story for image-pull secrets is unglamorous and easy to get wrong. We rotate ours every 90 days via a CronJob that mints a new robot token, updates the secret, and re-rolls the affected workloads. About 50 lines of bash + kubectl. Boring, works.

Week 5-6: the things nobody warned us about

Pull-through latency under load. Harbor running on the same cluster as the workloads pulling from it: when the cluster gets busy, Harbor gets slow, which makes pulls slow, which slows down auto-scaling. We saw a 4-minute new-pod-startup time during a traffic spike that should have been 30 seconds. The workloads couldn't scale fast enough because they couldn't pull images fast enough.

The fix was running Harbor on a dedicated node pool with reserved capacity. Not glamorous. Necessary.

Build-time pushes choking on the network policy. Our GitHub Actions self-hosted runners (also on AKS) push to Harbor. Initially the network policy between the runners namespace and Harbor's namespace was too tight, and pushes timed out about 5% of the time on large images. Increased the connection timeout, opened the right ports, problem solved.

Defender for Cloud's container scanning still ran on every push to ACR. Because we kept ACR active for two weeks during the cutover, Defender kept scanning. We were paying for two scanners. Trivial cost in our case (~$30 over two weeks), but worth knowing.

What got better

Multi-registry vulnerability gates. Harbor's project-level "block on critical vulnerabilities" policy works the same way regardless of where the image came from — Harbor scans on push and on pull. We now enforce the same security gate for ACR-built, GHCR-built, and Docker Hub mirror images. Couldn't do that before.

Image signing with cosign was easier. Harbor 2.x has native cosign integration via its replication policies. We turned on signature requirement for the prod project. Unsigned images don't get pulled. Setting the same up across ACR + a separate signing infrastructure would have been more complex.

Cost visibility per-team. Harbor projects map cleanly to teams. We can see exactly how much storage each team's images consume. ACR has this via tags but it's noisier.

What got worse

Operational burden. ACR was a managed service. Harbor is our problem. Patches, certs, storage capacity planning, HA — all on us. We added about 4-6 hours/month of platform-team toil for Harbor that didn't exist for ACR.

First-time pull latency from new regions. ACR with geo-replication serves pulls from the closest replica. Harbor on a single region serves all pulls. For a workload running in West Europe pulling from a Harbor in East US, that's measurable latency on first pull (subsequent pulls are cached at the node level). We mitigated by putting Harbor in our most-used region and accepting the trade-off.

The honest verdict

For most teams, ACR is the right choice. Don't migrate to Harbor unless you have a specific need that ACR can't meet — multi-registry vulnerability scanning, complex replication topologies, or a hard requirement to self-host the registry layer.

Our specific need was the first one. Harbor solved it. We pay the operational cost. Net positive for us; would not have been net positive for the team next door who didn't have the multi-registry need.

What I'd do differently

Provision Harbor on a separate AKS cluster from the workload cluster, not the same cluster. The "Harbor and the workloads compete for resources" problem would have been avoided entirely with that separation. We co-located for cost reasons. The cost savings ended up being eaten by the dedicated node pool we had to add anyway.

I would NOT skip the dual-registry validation period. Two weeks of paying for both was worth the peace of mind that the migration was clean.

ACRHarborGitOpsAKS

Swapping ACR for Harbor in an AKS GitOps Pipeline: What Broke, What Didn't

The migration shape

Week 1-2: standing up Harbor properly

Week 3: the registry migration itself

Week 4: workload cutover

Week 5-6: the things nobody warned us about

What got better

What got worse

The honest verdict

What I'd do differently

Conversation

More from DevOps

Five Gotchas When Wiring Azure DevOps MCP Server Into VS Code Copilot

Time-Slicing vs MIG for Bursty LLM Inference Traffic on AKS GPU Node Pools

Backstage on AKS With CAPZ + ASO Instead of Crossplane: When the Tooling Choice Matters