Skip to content
A
No. 17Azure AIOct 25, 202510 min read

Fine-Tuning a Llama Variant With KAITO on AKS, Then Stress-Testing the Inference Endpoint

KAITO (Kubernetes AI Toolchain Operator) on AKS is the smoothest path I've found for "I want to fine-tune a Llama variant on my domain data and serve it as an inference endpoint." The whole pipeline — node provisioning, training, …

KAITO (Kubernetes AI Toolchain Operator) on AKS is the smoothest path I've found for "I want to fine-tune a Llama variant on my domain data and serve it as an inference endpoint." The whole pipeline — node provisioning, training, deployment, inference — fits into a few CRDs that anyone comfortable with Kubernetes can manage.

I fine-tuned a Llama 3.1 8B variant on a domain-specific dataset (about 14,000 examples), deployed the result via KAITO, and stress-tested the endpoint to find its actual production-ready throughput. Here's what worked, what didn't, and what I'd do differently.

What KAITO actually does

KAITO has two main CRDs:

  • Workspace — the inference deployment. Specify a model, GPU requirements, and KAITO handles node provisioning (via Karpenter), pulling the model image, deploying vLLM as the serving runtime, and exposing an endpoint.
  • RAGEngine — RAG-as-a-CRD; useful but not what I was doing here.

For fine-tuning specifically, KAITO has a tuning field on the Workspace CRD that takes a dataset, hyperparameters, and a base model. The operator runs the tune job, saves the resulting model artifacts back to a configured storage location, and (optionally) deploys the tuned model as the inference endpoint.

Total YAML for the whole pipeline: ~80 lines.

The fine-tune setup

apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
  name: domain-llama-finetuned
spec:
  resource:
    instanceType: Standard_NC24ads_A100_v4
    labelSelector:
      matchLabels:
        apps: domain-llama
  tuning:
    preset:
      name: llama-3-1-8b-instruct
    method: lora
    config: |
      learning_rate: 2e-4
      num_train_epochs: 3
      per_device_train_batch_size: 4
      gradient_accumulation_steps: 4
      lora_r: 16
      lora_alpha: 32
      lora_dropout: 0.05
    input:
      urls:
        - "https://[YOUR-STORAGE].blob.core.windows.net/datasets/domain-instructions.jsonl"
    output:
      image: "[YOUR-REGISTRY].azurecr.io/llama-domain-finetuned:v1"
      imagePushSecret: registry-creds

The dataset format is JSONL where each line is {"instruction": "...", "input": "...", "output": "..."}. We had this format from a previous fine-tune attempt; KAITO consumed it without modification.

The first run took about 4 hours on a single A100 80GB. KAITO provisioned the node, ran the tune, and pushed the resulting LoRA weights to ACR as a docker image. At that point the model is sitting in our registry, ready to be deployed.

The inference setup

Replace tuning with inference and the same Workspace CRD becomes a deployment:

apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
  name: domain-llama-inference
spec:
  resource:
    instanceCount: 2
    instanceType: Standard_NC24ads_A100_v4
    labelSelector:
      matchLabels:
        apps: domain-llama
  inference:
    preset:
      name: llama-3-1-8b-instruct
      presetOptions:
        image: "[YOUR-REGISTRY].azurecr.io/llama-domain-finetuned:v1"

KAITO provisions two A100 nodes, deploys vLLM with the fine-tuned model on each, and exposes a Kubernetes Service. The inference endpoint speaks the OpenAI API format, which is convenient — our existing client code worked unchanged.

Time from kubectl apply to first successful inference: about 12 minutes. Most of that is node provisioning + image pull. Subsequent rolling updates take 2-3 minutes.

The stress test

I wrote a small Locust script to hammer the endpoint with realistic traffic patterns. 100 virtual users, ramp up over 60 seconds, sustain for 10 minutes, then ramp down.

Results from the run:

  • Throughput at steady state: 47 requests/second across both replicas.
  • First-token latency p50: 220ms.
  • First-token latency p95: 410ms.
  • Streaming tokens/second p50: 145 tokens/sec.
  • Streaming tokens/second p95: 95 tokens/sec.
  • GPU utilization at steady state: 78% on both nodes.

The 78% utilization at 47 RPS suggests we're near the throughput ceiling for this configuration. Adding a third replica would buy us another 50% throughput at the cost of 50% more compute.

The real test: degraded modes

Throughput in optimal conditions is only half the story. What does the system do under stress?

Test 1: prompt 4x longer than typical. Some users send 3K-token prompts where average is 800 tokens. Throughput dropped from 47 RPS to 14 RPS. Latency p95 went from 410ms to 1.8s. KV cache pressure became the bottleneck. We don't have an easy fix; we added per-user prompt-length limits at the gateway.

Test 2: simulated failure of one replica. Killed one of the two pods mid-run. KAITO restarted it within 30 seconds. During the 30-second window, the other replica handled 100% of traffic and degraded — p95 latency went to ~2s, some requests timed out. Recovery was clean once the second replica was back. Lesson: two replicas is the minimum and even that has a degradation window during incidents. For customer-facing surfaces, run three.

Test 3: cold start of the model. Killed both replicas, simulating a complete outage. Total recovery time: 12 minutes. Most of that is the model image pull + vLLM startup. KAITO doesn't currently offer a faster cold-start path. For our use case this is acceptable; for stricter SLA workloads it would be a problem.

What KAITO does well

The CRD interface. Two CRDs, sane defaults, good preset library. Compare to writing a vLLM Helm chart from scratch + handling node scheduling + handling model artifact distribution + handling fine-tune jobs separately. KAITO collapses all of that into something you can read in one screen.

Karpenter integration. Node provisioning is fast. KAITO + Karpenter together provision the right GPU SKU based on the model's resource needs. We didn't have to write any node-pool YAML.

OpenAI API compatibility. The endpoint speaks OpenAI's API format. Our existing client code (built for Azure OpenAI) worked against the KAITO endpoint with a single config change.

What KAITO doesn't do well yet

Cold-start performance. 12-minute recovery from full outage is too long for some workloads. The model image pull is the biggest contributor; KAITO could optimize this but doesn't yet.

Fine-tune observability. While the tune job is running, you get logs from the operator and the training pod. You don't get metrics like "tokens/sec during training" or "estimated time to completion." You guess from the log timestamps.

Multi-model serving. KAITO is one-model-per-Workspace. If you want to serve five model variants from one node pool with shared GPU memory (a la vLLM's multi-model serving), you're outside KAITO's happy path.

What I'd do differently

Run the full stress test before going live, not after. We did the inference deployment, declared victory, and discovered the long-prompt degraded mode in production a week later. Cost us a customer-facing incident. Should have stress-tested the system on day 1.

I would NOT use the LoRA fine-tune for the customer-facing surface without first running a full evaluation against the base model. We did this — the fine-tuned model was significantly better on our domain (87% vs 71% on the eval set) — but we did the eval AFTER deploying. Sequence matters; eval first, deploy second.

The portable lesson

KAITO turns "I want to fine-tune and serve a model on AKS" from a multi-week platform-engineering project into an afternoon. The compromise is that you're operating in KAITO's opinions, not yours. For most workloads — especially first attempts at self-hosted fine-tuned models — those opinions are sensible.

If you outgrow KAITO, you can swap to direct vLLM Helm charts and keep the rest of your platform. The lock-in is mild.

KAITOAKSFine-tuningLlama

Conversation

Reactions & comments

Liked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.

Loading conversation…

More from Azure AI

See all →