KAITO (Kubernetes AI Toolchain Operator) on AKS is the smoothest path I've found for "I want to fine-tune a Llama variant on my domain data and serve it as an inference endpoint." The whole pipeline — node provisioning, training, deployment, inference — fits into a few CRDs that anyone comfortable with Kubernetes can manage.
I fine-tuned a Llama 3.1 8B variant on a domain-specific dataset (about 14,000 examples), deployed the result via KAITO, and stress-tested the endpoint to find its actual production-ready throughput. Here's what worked, what didn't, and what I'd do differently.
What KAITO actually does
KAITO has two main CRDs:
Workspace— the inference deployment. Specify a model, GPU requirements, and KAITO handles node provisioning (via Karpenter), pulling the model image, deploying vLLM as the serving runtime, and exposing an endpoint.RAGEngine— RAG-as-a-CRD; useful but not what I was doing here.
For fine-tuning specifically, KAITO has a tuning field on the Workspace CRD that takes a dataset, hyperparameters, and a base model. The operator runs the tune job, saves the resulting model artifacts back to a configured storage location, and (optionally) deploys the tuned model as the inference endpoint.
Total YAML for the whole pipeline: ~80 lines.
The fine-tune setup
apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
name: domain-llama-finetuned
spec:
resource:
instanceType: Standard_NC24ads_A100_v4
labelSelector:
matchLabels:
apps: domain-llama
tuning:
preset:
name: llama-3-1-8b-instruct
method: lora
config: |
learning_rate: 2e-4
num_train_epochs: 3
per_device_train_batch_size: 4
gradient_accumulation_steps: 4
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
input:
urls:
- "https://[YOUR-STORAGE].blob.core.windows.net/datasets/domain-instructions.jsonl"
output:
image: "[YOUR-REGISTRY].azurecr.io/llama-domain-finetuned:v1"
imagePushSecret: registry-creds
The dataset format is JSONL where each line is {"instruction": "...", "input": "...", "output": "..."}. We had this format from a previous fine-tune attempt; KAITO consumed it without modification.
The first run took about 4 hours on a single A100 80GB. KAITO provisioned the node, ran the tune, and pushed the resulting LoRA weights to ACR as a docker image. At that point the model is sitting in our registry, ready to be deployed.
The inference setup
Replace tuning with inference and the same Workspace CRD becomes a deployment:
apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
name: domain-llama-inference
spec:
resource:
instanceCount: 2
instanceType: Standard_NC24ads_A100_v4
labelSelector:
matchLabels:
apps: domain-llama
inference:
preset:
name: llama-3-1-8b-instruct
presetOptions:
image: "[YOUR-REGISTRY].azurecr.io/llama-domain-finetuned:v1"
KAITO provisions two A100 nodes, deploys vLLM with the fine-tuned model on each, and exposes a Kubernetes Service. The inference endpoint speaks the OpenAI API format, which is convenient — our existing client code worked unchanged.
Time from kubectl apply to first successful inference: about 12 minutes. Most of that is node provisioning + image pull. Subsequent rolling updates take 2-3 minutes.
The stress test
I wrote a small Locust script to hammer the endpoint with realistic traffic patterns. 100 virtual users, ramp up over 60 seconds, sustain for 10 minutes, then ramp down.
Results from the run:
- Throughput at steady state: 47 requests/second across both replicas.
- First-token latency p50: 220ms.
- First-token latency p95: 410ms.
- Streaming tokens/second p50: 145 tokens/sec.
- Streaming tokens/second p95: 95 tokens/sec.
- GPU utilization at steady state: 78% on both nodes.
The 78% utilization at 47 RPS suggests we're near the throughput ceiling for this configuration. Adding a third replica would buy us another 50% throughput at the cost of 50% more compute.
The real test: degraded modes
Throughput in optimal conditions is only half the story. What does the system do under stress?
Test 1: prompt 4x longer than typical. Some users send 3K-token prompts where average is 800 tokens. Throughput dropped from 47 RPS to 14 RPS. Latency p95 went from 410ms to 1.8s. KV cache pressure became the bottleneck. We don't have an easy fix; we added per-user prompt-length limits at the gateway.
Test 2: simulated failure of one replica. Killed one of the two pods mid-run. KAITO restarted it within 30 seconds. During the 30-second window, the other replica handled 100% of traffic and degraded — p95 latency went to ~2s, some requests timed out. Recovery was clean once the second replica was back. Lesson: two replicas is the minimum and even that has a degradation window during incidents. For customer-facing surfaces, run three.
Test 3: cold start of the model. Killed both replicas, simulating a complete outage. Total recovery time: 12 minutes. Most of that is the model image pull + vLLM startup. KAITO doesn't currently offer a faster cold-start path. For our use case this is acceptable; for stricter SLA workloads it would be a problem.
What KAITO does well
The CRD interface. Two CRDs, sane defaults, good preset library. Compare to writing a vLLM Helm chart from scratch + handling node scheduling + handling model artifact distribution + handling fine-tune jobs separately. KAITO collapses all of that into something you can read in one screen.
Karpenter integration. Node provisioning is fast. KAITO + Karpenter together provision the right GPU SKU based on the model's resource needs. We didn't have to write any node-pool YAML.
OpenAI API compatibility. The endpoint speaks OpenAI's API format. Our existing client code (built for Azure OpenAI) worked against the KAITO endpoint with a single config change.
What KAITO doesn't do well yet
Cold-start performance. 12-minute recovery from full outage is too long for some workloads. The model image pull is the biggest contributor; KAITO could optimize this but doesn't yet.
Fine-tune observability. While the tune job is running, you get logs from the operator and the training pod. You don't get metrics like "tokens/sec during training" or "estimated time to completion." You guess from the log timestamps.
Multi-model serving. KAITO is one-model-per-Workspace. If you want to serve five model variants from one node pool with shared GPU memory (a la vLLM's multi-model serving), you're outside KAITO's happy path.
What I'd do differently
Run the full stress test before going live, not after. We did the inference deployment, declared victory, and discovered the long-prompt degraded mode in production a week later. Cost us a customer-facing incident. Should have stress-tested the system on day 1.
I would NOT use the LoRA fine-tune for the customer-facing surface without first running a full evaluation against the base model. We did this — the fine-tuned model was significantly better on our domain (87% vs 71% on the eval set) — but we did the eval AFTER deploying. Sequence matters; eval first, deploy second.
The portable lesson
KAITO turns "I want to fine-tune and serve a model on AKS" from a multi-week platform-engineering project into an afternoon. The compromise is that you're operating in KAITO's opinions, not yours. For most workloads — especially first attempts at self-hosted fine-tuned models — those opinions are sensible.
If you outgrow KAITO, you can swap to direct vLLM Helm charts and keep the rest of your platform. The lock-in is mild.

Conversation
Reactions & commentsLiked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.