No. 17Azure AIOct 25, 202510 min read

Fine-Tuning a Llama Variant With KAITO on AKS, Then Stress-Testing the Inference Endpoint

KAITO (Kubernetes AI Toolchain Operator) on AKS is the smoothest path I've found for "I want to fine-tune a Llama variant on my domain data and serve it as an inference endpoint." The whole pipeline — node provisioning, training, …

By Damilola Onadeinde

Microsoft MVP — DevOps

Share on X LinkedIn

I fine-tuned a Llama 3.1 8B variant on a domain-specific dataset (about 14,000 examples), deployed the result via KAITO, and stress-tested the endpoint to find its actual production-ready throughput. Here's what worked, what didn't, and what I'd do differently.

What KAITO actually does

KAITO has two main CRDs:

Workspace — the inference deployment. Specify a model, GPU requirements, and KAITO handles node provisioning (via Karpenter), pulling the model image, deploying vLLM as the serving runtime, and exposing an endpoint.
RAGEngine — RAG-as-a-CRD; useful but not what I was doing here.

For fine-tuning specifically, KAITO has a tuning field on the Workspace CRD that takes a dataset, hyperparameters, and a base model. The operator runs the tune job, saves the resulting model artifacts back to a configured storage location, and (optionally) deploys the tuned model as the inference endpoint.

Total YAML for the whole pipeline: ~80 lines.

The fine-tune setup

ki-light:#24292e;--shiki-dark:#adbac7;--shiki-light-bg:#fff;--shiki-dark-bg:#22272e" tabindex="0">

apiVersion: kaito.sh/v1alpha1 kind: Workspace metadata: name: domain-llama-finetuned spec: resource: instanceType: Standard_NC24ads_A100_v4 labelSelector: matchLabels: apps: domain-llama tuning: preset: name: llama-3-1-8b-instruct method: lora config: | learning_rate: 2e-4 num_train_epochs: 3 per_device_train_batch_size: 4 gradient_accumulation_steps: 4 lora_r: 16 lora_alpha: 32 lora_dropout: 0.05 input: urls: - "https://[YOUR-STORAGE].blob.core.windows.net/datasets/domain-instructions.jsonl" output: image: "[YOUR-REGISTRY].azurecr.io/llama-domain-finetuned:v1" imagePushSecret: registry-creds
The dataset format is JSONL where each line is {"instruction": "...", "input": "...", "output": "..."}. We had this format from a previous fine-tune attempt; KAITO consumed it without modification.
The first run took about 4 hours on a single A100 80GB. KAITO provisioned the node, ran the tune, and pushed the resulting LoRA weights to ACR as a docker image. At that point the model is sitting in our registry, ready to be deployed.
The inference setup
Replace tuning with inference and the same Workspace CRD becomes a deployment:
apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
  name: domain-llama-inference
spec:
  resource:
    instanceCount: 2
    instanceType: Standard_NC24ads_A100_v4
    labelSelector:
      matchLabels:
        apps: domain-llama
  inference:
    preset:
      name: llama-3-1-8b-instruct
      presetOptions:
        image: "[YOUR-REGISTRY].azurecr.io/llama-domain-finetuned:v1"

KAITO provisions two A100 nodes, deploys vLLM with the fine-tuned model on each, and exposes a Kubernetes Service. The inference endpoint speaks the OpenAI API format, which is convenient — our existing client code worked unchanged.
Time from kubectl apply to first successful inference: about 12 minutes. Most of that is node provisioning + image pull. Subsequent rolling updates take 2-3 minutes.
The stress test
I wrote a small Locust script to hammer the endpoint with realistic traffic patterns. 100 virtual users, ramp up over 60 seconds, sustain for 10 minutes, then ramp down.
Results from the run:

Throughput at steady state: 47 requests/second across both replicas.
First-token latency p50: 220ms.
First-token latency p95: 410ms.
Streaming tokens/second p50: 145 tokens/sec.
Streaming tokens/second p95: 95 tokens/sec.
GPU utilization at steady state: 78% on both nodes.

The 78% utilization at 47 RPS suggests we're near the throughput ceiling for this configuration. Adding a third replica would buy us another 50% throughput at the cost of 50% more compute.
The real test: degraded modes
Throughput in optimal conditions is only half the story. What does the system do under stress?
Test 1: prompt 4x longer than typical. Some users send 3K-token prompts where average is 800 tokens. Throughput dropped from 47 RPS to 14 RPS. Latency p95 went from 410ms to 1.8s. KV cache pressure became the bottleneck. We don't have an easy fix; we added per-user prompt-length limits at the gateway.
Test 2: simulated failure of one replica. Killed one of the two pods mid-run. KAITO restarted it within 30 seconds. During the 30-second window, the other replica handled 100% of traffic and degraded — p95 latency went to ~2s, some requests timed out. Recovery was clean once the second replica was back. Lesson: two replicas is the minimum and even that has a degradation window during incidents. For customer-facing surfaces, run three.
Test 3: cold start of the model. Killed both replicas, simulating a complete outage. Total recovery time: 12 minutes. Most of that is the model image pull + vLLM startup. KAITO doesn't currently offer a faster cold-start path. For our use case this is acceptable; for stricter SLA workloads it would be a problem.
What KAITO does well
The CRD interface. Two CRDs, sane defaults, good preset library. Compare to writing a vLLM Helm chart from scratch + handling node scheduling + handling model artifact distribution + handling fine-tune jobs separately. KAITO collapses all of that into something you can read in one screen.
Karpenter integration. Node provisioning is fast. KAITO + Karpenter together provision the right GPU SKU based on the model's resource needs. We didn't have to write any node-pool YAML.
OpenAI API compatibility. The endpoint speaks OpenAI's API format. Our existing client code (built for Azure OpenAI) worked against the KAITO endpoint with a single config change.
What KAITO doesn't do well yet
Cold-start performance. 12-minute recovery from full outage is too long for some workloads. The model image pull is the biggest contributor; KAITO could optimize this but doesn't yet.
Fine-tune observability. While the tune job is running, you get logs from the operator and the training pod. You don't get metrics like "tokens/sec during training" or "estimated time to completion." You guess from the log timestamps.
Multi-model serving. KAITO is one-model-per-Workspace. If you want to serve five model variants from one node pool with shared GPU memory (a la vLLM's multi-model serving), you're outside KAITO's happy path.
What I'd do differently
Run the full stress test before going live, not after. We did the inference deployment, declared victory, and discovered the long-prompt degraded mode in production a week later. Cost us a customer-facing incident. Should have stress-tested the system on day 1.
I would NOT use the LoRA fine-tune for the customer-facing surface without first running a full evaluation against the base model. We did this — the fine-tuned model was significantly better on our domain (87% vs 71% on the eval set) — but we did the eval AFTER deploying. Sequence matters; eval first, deploy second.
The portable lesson
KAITO turns "I want to fine-tune and serve a model on AKS" from a multi-week platform-engineering project into an afternoon. The compromise is that you're operating in KAITO's opinions, not yours. For most workloads — especially first attempts at self-hosted fine-tuned models — those opinions are sensible.
If you outgrow KAITO, you can swap to direct vLLM Helm charts and keep the rest of your platform. The lock-in is mild.

KAITOAKSFine-tuningLlama

`Conversation`

Reactions & comments

Liked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.

Loading conversation…

 Older
vLLM-on-AKS vs Azure OpenAI: Where the Cost Crossover Actually Sits at 1M Tokens/DayNewer 
Time-Slicing vs MIG for Bursty LLM Inference Traffic on AKS GPU Node Pools

`More from Azure AI`

See all →


Azure AINov 29, 2025
Killing API Keys in Azure OpenAI: A Managed Identity + Entra ID Migration Done Properly
Six months ago we had 14 services calling Azure OpenAI. All of them used API keys. The keys were stored in Azure Key Vault, fetched at startup, and rotated quarterly via a manual process that "everyone knew was fragile."
9 min read


Azure AINov 8, 2025
Per-Tenant Cost Attribution for Azure OpenAI Traffic Using APIM emit-token-metric-policy
Six months ago I couldn't tell you which tenant of our SaaS application was costing us the most in Azure OpenAI tokens. The number on the bill was real but unattributable.
9 min read


Azure AIOct 18, 2025
vLLM-on-AKS vs Azure OpenAI: Where the Cost Crossover Actually Sits at 1M Tokens/Day
"Self-host the model and save money" is one of those statements that's true at scale and false below it. The interesting question isn't whether self-hosting is cheaper — it's where the crossover point is for your specific workload.
11 min read