Skip to content
damionas
No. 33DevOpsNov 11, 20259 min read

Karpenter on AKS vs Cluster Autoscaler vs Node Auto-Provisioning: The Workload Where Each Wins

The "what scales nodes on AKS" question used to have one answer: Cluster Autoscaler. Now there are three: Cluster Autoscaler (CA), AKS **Node Auto-Provisioning** (NAP, which is Karpenter underneath), and self-managed **Karpenter o…

The "what scales nodes on AKS" question used to have one answer: Cluster Autoscaler. Now there are three: Cluster Autoscaler (CA), AKS Node Auto-Provisioning (NAP, which is Karpenter underneath), and self-managed Karpenter on AKS. They look similar from a marketing slide and behave very differently under pressure.

This is the side-by-side I wish I'd had when we picked. We run all three across different clusters; the workload shape decides.

What each one is

Cluster Autoscaler. Watches pending pods, scales VMSS-backed node pools up. You define node pools by hand. Scale-down is conservative, pods get a 10-minute grace period by default. Predictable, well-understood, dumb.

AKS Node Auto-Provisioning. Microsoft-managed Karpenter, behind a stable API. You declare NodePool and NodeClass resources, NAP picks the cheapest VM SKU that fits the workload. No VMSS to manage. Bin-packing-aware.

Self-managed Karpenter. The same bin-packer, but you run the controller. More flexibility (custom controllers, custom AMIs), more responsibility (upgrades, IRSA setup, RBAC).

The decision matrix that's actually held up

Workload shape Pick
Stable, well-understood node pool sizes; spiky 2-4x bursts Cluster Autoscaler
Mixed CPU/GPU/memory pods, tight cost target, GPU SKU diversity matters NAP
Heterogeneous workloads + you need custom logic (Spot priority, regional affinity, custom AMI) Self-managed Karpenter
You want one answer and don't want to evaluate NAP (default for new clusters)

The trap: every team I've seen evaluate "Karpenter vs CA" picks Karpenter on the bin-packing argument and then finds out their actual workloads were CA-shaped, stable pools with predictable headroom, and the savings were negligible.

NAP setup (the path most teams should take)

# Enable NAP at cluster create or upgrade
az aks update \
  --name aks-prod \
  --resource-group rg-aks \
  --node-provisioning-mode Auto

# Define a NodePool — what shapes are eligible
kubectl apply -f - <<'YAML'
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: [amd64]
        - key: karpenter.sh/capacity-type
          operator: In
          values: [on-demand, spot]
        - key: kubernetes.io/os
          operator: In
          values: [linux]
        - key: karpenter.azure.com/sku-family
          operator: In
          values: [D, E]
      nodeClassRef:
        name: default
        group: karpenter.azure.com
        kind: AKSNodeClass
  limits:
    cpu: 100
    memory: 200Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
YAML

Three things to set explicitly, even if the defaults look fine:

  • karpenter.sh/capacity-type, adding spot cuts cost by ~70% for workloads that tolerate eviction. If the workload doesn't, list only on-demand.
  • sku-family, restricts the candidate VM SKUs. [D, E] = balanced + memory-optimised. Without it NAP can pick anything, including older D2/D3 generations that under-perform on the same dollar.
  • consolidationPolicy: WhenEmptyOrUnderutilized, actively packs pods onto fewer, larger nodes. The default WhenEmpty is too conservative and leaves money on the table.

Cluster Autoscaler, when it's still the right answer

az aks nodepool update \
  --cluster-name aks-prod --resource-group rg-aks \
  --name workload \
  --enable-cluster-autoscaler \
  --min-count 3 --max-count 30 \
  --scan-interval 10s

CA on a single workload pool is boring and predictable, exactly what an SLO-bound workload wants. We have three production clusters running stable pools (one for the API tier, one for batch, one for ingest), all CA, all sized for the 90th percentile traffic, all doing nothing exciting.

The argument for CA on these: the nodes look the same every day, the headroom is the same every day, the cost optimisation from bin-packing is rounding error compared to the operational simplicity. We don't pay for the option to be smart on a workload that doesn't need it.

Self-managed Karpenter, when you actually need it

We hit a case where NAP wasn't enough: a workload that needed to prefer Spot, fall back to on-demand inside 90 seconds if Spot was unavailable in the region, and round-robin across two zones for blast-radius reasons. NAP's interruption policy isn't that flexible. Self-managed Karpenter let us write a custom Disruption policy:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata: { name: spot-preferred }
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: [spot]   # spot only at first
      taints:
        - key: prefer-spot
          effect: NoSchedule
  weight: 100
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata: { name: on-demand-fallback }
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: [on-demand]
  weight: 10

Karpenter weights mean "try the higher weight first". Pod schedules; if Spot is unavailable, the controller flips to the on-demand pool within ~60-90 seconds.

This was 1.5 sprints of work to roll out cleanly: Helm chart, IRSA, controller-room observability. It wouldn't be worth it for a normal workload. It was worth it for that workload.

What broke first

NAP plus PDBs that were too strict. A workload with maxUnavailable: 0 blocks NAP's consolidation indefinitely. NAP logs a clear "blocked by PDB" event but the workload silently runs on more nodes than it needs. Audit your PDBs before turning on consolidation.

CA scale-down ignored a noisy DaemonSet. A logging agent rolled across every node and bumped each node's "non-pod-evictable" footprint by 200MB. CA decided no node was scale-down-eligible. Fix: tag the DaemonSet with cluster-autoscaler.kubernetes.io/safe-to-evict: "true" (the documented annotation) so CA can ignore it during scale-down decisions.

Self-managed Karpenter upgrades. The controller has a tight contract with the AKS control plane on instance metadata. We had a stuck upgrade where the new controller couldn't enumerate VMSS instances because the workload identity it ran under had been re-rolled. NAP would have handled that for us. The flexibility tax is real.

Numbers, last quarter

  • NAP on the AI workloads cluster: 38% lower compute cost than the equivalent CA setup, mostly from Spot adoption + tighter consolidation
  • CA on the API tier: same cost-per-request as last year, but zero scale-related incidents
  • Self-managed Karpenter on the Spot-preferred batch workload: 62% cost reduction vs on-demand, ~3 incidents/quarter where Spot churn caused a longer tail latency than expected

What I'd do differently

Default new clusters to NAP. The "what node pool shape do I need" question is the wrong question for 80% of workloads, the right answer is "let the bin-packer figure it out." CA stays for stable, latency-sensitive tiers where predictability beats cost. Self-managed Karpenter is the option you reach for after you've outgrown NAP, not before.

I would NOT mix CA and NAP on the same cluster. Both can coexist on paper but the consolidation logic gets weird at the boundary, and the operational story (where did this node come from?) is harder to keep straight. One scaler per cluster.

AKSKarpenterCluster AutoscalerNAP

Conversation

Reactions & comments

Liked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.

Loading conversation…

More from DevOps

See all →