The "what scales nodes on AKS" question used to have one answer: Cluster Autoscaler. Now there are three: Cluster Autoscaler (CA), AKS Node Auto-Provisioning (NAP, which is Karpenter underneath), and self-managed Karpenter on AKS. They look similar from a marketing slide and behave very differently under pressure.
This is the side-by-side I wish I'd had when we picked. We run all three across different clusters; the workload shape decides.
What each one is
Cluster Autoscaler. Watches pending pods, scales VMSS-backed node pools up. You define node pools by hand. Scale-down is conservative, pods get a 10-minute grace period by default. Predictable, well-understood, dumb.
AKS Node Auto-Provisioning. Microsoft-managed Karpenter, behind a stable API. You declare NodePool and NodeClass resources, NAP picks the cheapest VM SKU that fits the workload. No VMSS to manage. Bin-packing-aware.
Self-managed Karpenter. The same bin-packer, but you run the controller. More flexibility (custom controllers, custom AMIs), more responsibility (upgrades, IRSA setup, RBAC).
The decision matrix that's actually held up
| Workload shape | Pick |
|---|---|
| Stable, well-understood node pool sizes; spiky 2-4x bursts | Cluster Autoscaler |
| Mixed CPU/GPU/memory pods, tight cost target, GPU SKU diversity matters | NAP |
| Heterogeneous workloads + you need custom logic (Spot priority, regional affinity, custom AMI) | Self-managed Karpenter |
| You want one answer and don't want to evaluate | NAP (default for new clusters) |
The trap: every team I've seen evaluate "Karpenter vs CA" picks Karpenter on the bin-packing argument and then finds out their actual workloads were CA-shaped, stable pools with predictable headroom, and the savings were negligible.
NAP setup (the path most teams should take)
# Enable NAP at cluster create or upgrade
az aks update \
--name aks-prod \
--resource-group rg-aks \
--node-provisioning-mode Auto
# Define a NodePool — what shapes are eligible
kubectl apply -f - <<'YAML'
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: [amd64]
- key: karpenter.sh/capacity-type
operator: In
values: [on-demand, spot]
- key: kubernetes.io/os
operator: In
values: [linux]
- key: karpenter.azure.com/sku-family
operator: In
values: [D, E]
nodeClassRef:
name: default
group: karpenter.azure.com
kind: AKSNodeClass
limits:
cpu: 100
memory: 200Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
YAML
Three things to set explicitly, even if the defaults look fine:
karpenter.sh/capacity-type, addingspotcuts cost by ~70% for workloads that tolerate eviction. If the workload doesn't, list onlyon-demand.sku-family, restricts the candidate VM SKUs.[D, E]= balanced + memory-optimised. Without it NAP can pick anything, including older D2/D3 generations that under-perform on the same dollar.consolidationPolicy: WhenEmptyOrUnderutilized, actively packs pods onto fewer, larger nodes. The defaultWhenEmptyis too conservative and leaves money on the table.
Cluster Autoscaler, when it's still the right answer
az aks nodepool update \
--cluster-name aks-prod --resource-group rg-aks \
--name workload \
--enable-cluster-autoscaler \
--min-count 3 --max-count 30 \
--scan-interval 10s
CA on a single workload pool is boring and predictable, exactly what an SLO-bound workload wants. We have three production clusters running stable pools (one for the API tier, one for batch, one for ingest), all CA, all sized for the 90th percentile traffic, all doing nothing exciting.
The argument for CA on these: the nodes look the same every day, the headroom is the same every day, the cost optimisation from bin-packing is rounding error compared to the operational simplicity. We don't pay for the option to be smart on a workload that doesn't need it.
Self-managed Karpenter, when you actually need it
We hit a case where NAP wasn't enough: a workload that needed to prefer Spot, fall back to on-demand inside 90 seconds if Spot was unavailable in the region, and round-robin across two zones for blast-radius reasons. NAP's interruption policy isn't that flexible. Self-managed Karpenter let us write a custom Disruption policy:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata: { name: spot-preferred }
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: [spot] # spot only at first
taints:
- key: prefer-spot
effect: NoSchedule
weight: 100
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata: { name: on-demand-fallback }
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: [on-demand]
weight: 10
Karpenter weights mean "try the higher weight first". Pod schedules; if Spot is unavailable, the controller flips to the on-demand pool within ~60-90 seconds.
This was 1.5 sprints of work to roll out cleanly: Helm chart, IRSA, controller-room observability. It wouldn't be worth it for a normal workload. It was worth it for that workload.
What broke first
NAP plus PDBs that were too strict. A workload with maxUnavailable: 0 blocks NAP's consolidation indefinitely. NAP logs a clear "blocked by PDB" event but the workload silently runs on more nodes than it needs. Audit your PDBs before turning on consolidation.
CA scale-down ignored a noisy DaemonSet. A logging agent rolled across every node and bumped each node's "non-pod-evictable" footprint by 200MB. CA decided no node was scale-down-eligible. Fix: tag the DaemonSet with cluster-autoscaler.kubernetes.io/safe-to-evict: "true" (the documented annotation) so CA can ignore it during scale-down decisions.
Self-managed Karpenter upgrades. The controller has a tight contract with the AKS control plane on instance metadata. We had a stuck upgrade where the new controller couldn't enumerate VMSS instances because the workload identity it ran under had been re-rolled. NAP would have handled that for us. The flexibility tax is real.
Numbers, last quarter
- NAP on the AI workloads cluster: 38% lower compute cost than the equivalent CA setup, mostly from Spot adoption + tighter consolidation
- CA on the API tier: same cost-per-request as last year, but zero scale-related incidents
- Self-managed Karpenter on the Spot-preferred batch workload: 62% cost reduction vs on-demand, ~3 incidents/quarter where Spot churn caused a longer tail latency than expected
What I'd do differently
Default new clusters to NAP. The "what node pool shape do I need" question is the wrong question for 80% of workloads, the right answer is "let the bin-packer figure it out." CA stays for stable, latency-sensitive tiers where predictability beats cost. Self-managed Karpenter is the option you reach for after you've outgrown NAP, not before.
I would NOT mix CA and NAP on the same cluster. Both can coexist on paper but the consolidation logic gets weird at the boundary, and the operational story (where did this node come from?) is harder to keep straight. One scaler per cluster.

Conversation
Reactions & commentsLiked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.