Notes from theAzuretrenches.
A Microsoft MVP's field reports on Azure, DevOps, and the Azure AI stack — including Foundry. Real patterns from real production work, not slide ware.
Featured field reports
All articles →Topics
All articles →Recent dispatches
07 latestAir-Gapped Azure OpenAI With Private Endpoints: A Terraform Module That Actually Works
"Air-gapped" is a strong word for something running in a public cloud, but it's the right word for what regulated customers want: an Azure OpenAI deployment whose only network path is through their own VNet, with public access ful…
Killing API Keys in Azure OpenAI: A Managed Identity + Entra ID Migration Done Properly
Six months ago we had 14 services calling Azure OpenAI. All of them used API keys. The keys were stored in Azure Key Vault, fetched at startup, and rotated quarterly via a manual process that "everyone knew was fragile."
Five Gotchas When Wiring Azure DevOps MCP Server Into VS Code Copilot
The Azure DevOps MCP Server's setup docs make it look like a five-minute task. It is, if everything goes right. Most teams hit one or more of these five issues, lose an afternoon, and conclude the tool is "buggy" when really it's …
Edge RAG on Azure Arc From a Real Site Deployment: Latency, Hardware, Cost
For most workloads, "RAG in the cloud" is the right shape. For some workloads — regulated industries, manufacturing floors, retail stores, remote sites with weak connectivity — the data legally or practically can't leave the site.…
Per-Tenant Cost Attribution for Azure OpenAI Traffic Using APIM emit-token-metric-policy
Six months ago I couldn't tell you which tenant of our SaaS application was costing us the most in Azure OpenAI tokens. The number on the bill was real but unattributable.
Time-Slicing vs MIG for Bursty LLM Inference Traffic on AKS GPU Node Pools
NVIDIA gives you two ways to share a single GPU across multiple workloads on Kubernetes: time-slicing and MIG (Multi-Instance GPU). The first is software-based and flexible. The second is hardware-partitioned and rigid.
Fine-Tuning a Llama Variant With KAITO on AKS, Then Stress-Testing the Inference Endpoint
KAITO (Kubernetes AI Toolchain Operator) on AKS is the smoothest path I've found for "I want to fine-tune a Llama variant on my domain data and serve it as an inference endpoint." The whole pipeline — node provisioning, training, …
New issues land when the work merits one.
No newsletter spam, no AI-generated filler. Just postmortems, patterns, and the occasional rant about private endpoints.