NVIDIA gives you two ways to share a single GPU across multiple workloads on Kubernetes: time-slicing and MIG (Multi-Instance GPU). The first is software-based and flexible. The second is hardware-partitioned and rigid. Both are documented well. Neither documentation tells you which one to actually pick for LLM inference on AKS, because the answer depends on what your traffic looks like.
We ran both options against the same workload — bursty LLM inference for an internal-tools chatbot — and the results were not what I expected. Here's the comparison and the decision framework.
The setup
Workload. Bursty LLM inference: about 70% of the day at 5-10 RPS, occasional spikes to 60+ RPS during business hours. Average prompt 1,200 tokens, average completion 500 tokens. Llama 3.1 8B running under vLLM. Latency target: p95 first-token under 500ms.
Hardware. A single A100 80GB GPU on a Standard_NC24ads_A100_v4 AKS node. The goal: serve the workload with reasonable efficiency on this single GPU.
The competing approaches:
- Time-slicing: NVIDIA's GPU operator advertises a single physical GPU as N "virtual" GPUs. The kernel-level scheduler shares the hardware across pods. We tested with 4-way and 8-way slicing.
- MIG: The A100 supports 7 hardware-partitioned slices (1g.10gb each). Each MIG instance is fully isolated from the others. We tested with 7 instances.
- Single-pod baseline: one pod uses the whole GPU. The control case.
What time-slicing does for bursty traffic
With 4-way time-slicing, four vLLM pods share the GPU. Each pod thinks it has a full GPU. The NVIDIA scheduler interleaves their CUDA contexts. Throughput is lower than the single-pod baseline (because of context-switch overhead) but utilization is dramatically more even.
Steady-state results:
- Throughput: 38 RPS aggregate across 4 pods (vs 45 RPS for single-pod baseline)
- First-token latency p50: 280ms (vs 220ms baseline)
- First-token latency p95: 540ms (vs 410ms baseline)
Slightly worse on every metric, which is expected. The win comes during the spikes: when one pod's traffic spikes from 1 RPS to 15 RPS, it grabs more of the GPU's compute. The other pods at low traffic barely notice.
8-way time-slicing was worse — too much context-switch overhead. 4-way was the sweet spot for our workload.
What MIG does for bursty traffic
With MIG enabled (7 × 1g.10gb instances), each instance is a hardware-isolated GPU with 10GB of memory. We deployed 7 vLLM pods, one per MIG instance.
Two immediate problems:
Problem 1: Llama 3.1 8B doesn't fit in 10GB. The model weights are ~16GB in FP16. We had to switch to a quantized version (INT4) that fits in 4-5GB. Quantization changed the latency and quality profile — the int4 model was 15% slower on tokens/sec and lost ~3% on our eval set.
Problem 2: 7 isolated instances meant 7 cold KV caches. Each pod is its own world. When traffic surges to one pod, that pod's KV cache fills up and rejection of new requests starts, even though six other pods are nearly idle.
Steady-state results (with int4 model):
- Throughput: 28 RPS aggregate across 7 pods (vs 45 RPS baseline)
- First-token latency p50: 380ms (vs 220ms baseline)
- First-token latency p95: 720ms (vs 410ms baseline)
MIG was significantly worse for our workload. The hardware isolation meant no resource sharing during spikes — exactly the property bursty traffic punishes.
When MIG would have won
MIG isn't bad. It's just bad for bursty same-tenant traffic. It's the right call for:
Multi-tenant isolation. If your seven pods are seven different tenants and you need hardware-level guarantees that tenant A's burst can't affect tenant B's latency, MIG gives you that. Time-slicing doesn't.
Quality-of-service tiers. If you have premium and standard tiers and need to guarantee premium tier latency regardless of standard tier load, MIG is the path. Map premium to specific MIG instances.
Steady predictable workloads. If each of your seven pods has steady predictable traffic that fits comfortably in 10GB, MIG works fine. The fixed-allocation rigidity isn't a problem if the allocation matches actual usage.
The thing the docs don't say
Both options require driver and operator configuration that's nontrivial on AKS. Here's the lay of the land:
Time-slicing setup: Install the NVIDIA GPU operator with time-slicing config. Apply a ConfigMap that specifies the replica count. Pods request nvidia.com/gpu: 1 as normal; the scheduler maps to a slice automatically. Took us about 90 minutes.
MIG setup: Install the NVIDIA GPU operator with MIG config. Tell it which MIG profile to use (mixed for varied sizes, single for uniform). The node has to be drained and the GPU mode has to be set, which requires node reboot in some configurations. Pods request nvidia.com/mig-1g.10gb: 1. Took us about 4 hours including a stuck-driver issue we had to manually unwind.
Documentation for both is in the NVIDIA GPU operator docs on AKS and the upstream operator README.
The decision framework
Pick time-slicing if:
- Traffic is bursty / spiky
- All pods are the same tenant or trust each other
- You need to share a GPU across pods that don't all need it simultaneously
- You're willing to accept ~10-15% throughput overhead for sharing
Pick MIG if:
- You need hardware isolation between workloads
- Your traffic is steady and predictable per workload
- You can fit each workload in the MIG instance size you've chosen
- You can tolerate the rigidity of fixed allocations
Pick neither — single pod per GPU if:
- One workload uses 60%+ of GPU capacity at peak
- You don't have multiple workloads competing for the same GPU
- You can scale horizontally (more nodes) instead of vertically (sharing)
What we ended up with
Three nodes. Each node runs a single vLLM pod with no GPU sharing. Horizontal scaling via cluster autoscaler.
This is the boring answer. It's also the cheapest answer at our actual traffic shape — the spikes that motivated us to investigate sharing in the first place are short enough that the autoscaler can react in time, and the steady-state utilization is high enough that we're not wasting much GPU capacity.
We use time-slicing on a fourth node that hosts non-LLM workloads (image embedding, smaller models) where the per-workload utilization is genuinely low and sharing makes sense.
We don't use MIG anywhere in production. The single workload that justified considering it (a multi-tenant LLM endpoint) eventually ran on dedicated nodes per tenant for compliance reasons.
What I'd do differently
Run the actual benchmark before assuming sharing will work. We spent two weeks on the MIG experiment based on the assumption that multi-instance GPUs would obviously be the right answer for our use case. The benchmark told us within a day that they weren't. Should have benchmarked first.
I would NOT skip the cluster autoscaler tuning if you're going single-pod-per-GPU. Default autoscale-down delay is 10 minutes, which is too long for cost-sensitive workloads. We tuned to 3 minutes for the LLM node pool. Significantly cheaper without affecting latency for the spikes that actually matter.

Conversation
Reactions & commentsLiked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.