"Self-host the model and save money" is one of those statements that's true at scale and false below it. The interesting question isn't whether self-hosting is cheaper — it's where the crossover point is for your specific workload.
We benchmarked vLLM running Llama 3.3 70B on AKS GPU node pools against Azure OpenAI's gpt-4o-mini. Same workload. Same evaluation set. Real numbers. Below 600K tokens/day, Azure OpenAI wins. Above 1.5M tokens/day, self-hosting wins. The middle is where the answer is "it depends."
Here's the math, the gotchas, and the reasons you'd choose one even if the math says the other.
The setup
Azure OpenAI option. gpt-4o-mini deployment in West Europe, pay-as-you-go. No PTU commitment.
vLLM option. Llama 3.3 70B running on a Standard_NC96ads_A100_v4 node pool (4× A100 80GB GPUs) on AKS. vLLM with tensor parallelism across the four GPUs. KAITO managing the deployment.
Both backed by the same FastAPI gateway that routes requests to one or the other based on a header. We can A/B real production traffic.
The token cost math
Azure OpenAI gpt-4o-mini pricing (as of this writing): ~$0.15 per million prompt tokens, ~$0.60 per million completion tokens.
For our workload: average prompt 800 tokens, average completion 400 tokens. So per request: $0.000120 prompt + $0.000240 completion = $0.000360 per request.
At 1M tokens/day total (roughly 833 requests at our token shape): $0.30/day, $9/month, $108/year.
Self-hosted vLLM cost:
- Compute: NC96ads A100 v4 spot is ~$8.50/hour, on-demand is ~$23/hour. Production we use on-demand for SLA reasons.
- 24/7 cost on-demand: ~$16,800/month, $201,600/year.
- That's a fixed cost regardless of how many tokens flow through it.
The crossover. Azure OpenAI cost scales linearly with tokens. vLLM cost is fixed. They cross where the AOAI annual cost equals $201,600.
At gpt-4o-mini pricing for our prompt-completion mix, that's roughly 1.7M tokens/day.
If you instead compare against gpt-4o (about 6x the cost): the crossover drops to ~280K tokens/day.
So: gpt-4o-mini → vLLM is cheaper above 1.7M/day. gpt-4o → vLLM is cheaper above 280K/day.
What the math doesn't capture
Underutilization on the self-hosted side. Our actual production traffic is 600K tokens/day on average, peaking at 2.5M/day. The vLLM node sits at ~25% utilization most of the time. Effective cost-per-token is dramatically worse than the back-of-napkin if you don't have the volume to keep the GPUs busy.
We tried autoscaling the node pool to zero during low traffic. Cold-start time for a 70B model is roughly 8 minutes. Customer-facing latency went from "fine" to "unacceptable." Pulled the autoscaling, accepted the underutilization cost.
Spot instances. If your workload tolerates eviction, spot is 60-70% cheaper. Our customer-facing path doesn't tolerate eviction (a 30-second eviction lag would break user-facing chat). Our internal batch-processing path does tolerate it; we run that on spot.
Networking egress. AOAI doesn't charge egress in the same region. Self-hosted does for any cross-region or cross-AZ traffic. For our workload this is ~$200/month — small relative to the GPU cost but worth noting.
Engineering time. vLLM deployment, monitoring, version management, model swap — these take engineer time. We average ~6 hours/month on vLLM platform work that doesn't exist for Azure OpenAI. At a fully-loaded engineer rate, that's another $1,500/month or so. Real cost.
What the math really doesn't capture: quality
The model matters. gpt-4o-mini and Llama 3.3 70B are different models. They get different things right.
For our workload (a code-generation assistant for our internal platform team):
| Metric | gpt-4o-mini | Llama 3.3 70B vLLM |
|---|---|---|
| Code correctness (eval set, 200 problems) | 71% | 74% |
| First-token latency (p50) | 320ms | 180ms |
| Streaming tokens/sec (p50) | 88 | 150 |
| Instruction following (eval set, 100 problems) | 84% | 79% |
Llama 3.3 70B was modestly better at code, modestly worse at instruction following, and notably faster at both first-token-latency and streaming.
The instruction-following gap mattered for one specific user surface (a feature that asks the model to format output in a specific schema). We ended up routing schema-strict requests to gpt-4o-mini and letting code-completion requests go to vLLM. The router lives in our gateway; it adds about 5ms of latency.
When you'd choose vLLM regardless of cost
Data residency. Some regulated workloads can't leave specific networks. Self-hosting on a customer-owned VNet is the only option.
Specific model needs. If you need a fine-tuned variant of an open model (Llama 3.3 with your domain LoRA, etc.), you can't get that from AOAI. KAITO + vLLM is the path.
Latency sensitivity to first token. vLLM running Llama 3.3 70B on dedicated A100s has lower first-token latency than AOAI's gpt-4o-mini. For a streaming chat experience, the perceived speed is meaningfully better. We have one customer-facing surface where this matters.
Throughput burst capacity. AOAI rate limits at the deployment level. If you have spiky workloads, you might hit rate limits even when daily volume is low. Self-hosted scales linearly with however many GPUs you have.
When you'd choose AOAI regardless of cost
Variability of workload. If your token volume swings 10x week-over-week, AOAI's pay-as-you-go model is dramatically more efficient than paying for fixed-cost GPUs you don't use.
Brand-new project. Standing up a self-hosted stack takes a week minimum. AOAI is "deploy a model, get an API key, start in 20 minutes." For prototypes and new projects, the time savings are huge.
No GPU operations expertise. vLLM is excellent but it's still a service you operate. If your team doesn't have GPU-on-Kubernetes experience, the operational burden is real.
Strict SLA requirements. AOAI's 99.9% SLA is contractual. Your self-hosted vLLM SLA is whatever your SRE practice can deliver — probably less.
Our actual decision
We run a hybrid. The customer-facing code-completion surface uses vLLM (latency wins, sufficient daily volume to amortize the GPUs). The internal-tools and ad-hoc surfaces use AOAI (lower volume, lower latency requirements, easier to operate).
About 65% of our token volume runs through vLLM. About 35% through AOAI. Total monthly cost is ~$13,000 for vLLM (one node pool, mostly steady-state) and ~$2,400 for AOAI. Same architecture without vLLM would have been ~$22,000/month at gpt-4o-mini token rates for our actual volume.
The savings are real but smaller than they'd be if we ran 100% on vLLM. Some of that's the underutilization tax; some of that's the routing overhead; some is the engineering time. Net-net: about $9K/month savings at our scale, or roughly $108K/year — close to one engineer's loaded cost.
Was it worth it? Yes, but only because we have the GPU-ops experience to operate it. For a team without that experience, the pure AOAI path is correct even at higher token volumes.
What I'd do differently
Run AOAI exclusively for the first six months of any new RAG/LLM workload. Get to know the actual token volume, the actual variance, and the actual quality requirements. Then evaluate whether self-hosting makes sense. Most teams over-estimate their volume and under-estimate the operational cost of self-hosting; most teams are better off staying on AOAI.
I would NOT skip the routing layer. Even when self-hosting, having a path to fall back to AOAI for specific request types or during incidents is operationally valuable.

Conversation
Reactions & commentsLiked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.