For most workloads, "RAG in the cloud" is the right shape. For some workloads — regulated industries, manufacturing floors, retail stores, remote sites with weak connectivity — the data legally or practically can't leave the site. Edge RAG with Azure Arc-enabled Kubernetes is one of the few productionable answers.
We deployed Edge RAG to a single industrial site as a pilot four months ago. This is what the hardware looks like, what the latency profile actually is, and what it costs compared to keeping the same workload in the cloud.
The use case
A manufacturing facility has 30 years of accumulated process documentation, incident reports, equipment manuals, and tribal knowledge in scanned PDFs and SharePoint pages. The site's senior engineers retire faster than they're replaced; their replacements need access to that knowledge in a format they can query during shifts.
Cloud RAG was off the table for two reasons:
- Some of the docs contain process IP that the company won't allow to leave the site.
- The site's internet connection is a 50Mbps redundant pair, which is fine for normal traffic but not for "every shift engineer asking the AI a question every 10 minutes" loads.
So: edge deployment. Run everything inside the site's network. Sync only telemetry and aggregate metrics back to the cloud.
The hardware
We landed on a 3-node hyperconverged setup, each node:
- Dell PowerEdge R760xa
- 2x Intel Xeon Gold 6448Y (32 cores total)
- 256GB RAM
- 4x Samsung PM1743 NVMe (15TB usable per node, mirrored)
- 2x NVIDIA L40S GPU (48GB VRAM each)
- Dual 25GbE NICs
Total capital cost: ~$140K including networking, racking, and the on-site UPS. The L40S choice was deliberate — we needed enough VRAM to run a reasonably-sized model with good throughput, but didn't need A100/H100 budgets for a workload that peaks at maybe 30 RPS during shift change.
The cluster runs a Microsoft AKS Edge Essentials cluster, Azure Arc-attached for management. From the operator's perspective in the cloud control plane, this looks like another AKS cluster. From the site's perspective, it's three boxes in a rack that survive a WAN outage.
The software stack
Built on Microsoft's Edge RAG package (which runs on Arc-enabled Kubernetes):
- Ingestion: Azure AI Document Intelligence running locally for OCR. Outputs go into a local Azure AI Search index (the on-prem version). The on-prem Search service is roughly half the features of cloud Search but the basics are there.
- Vector store: same local Azure AI Search index, vector-enabled.
- Generation: Llama 3.1 70B running on the L40S GPUs via vLLM. We considered smaller models for the latency win; the 70B's quality won out for technical Q&A.
- Frontend: a stripped-down version of our cloud chat interface, deployed to the cluster, accessed via the site's intranet.
- Sync to cloud: aggregate metrics, no content. Telemetry only.
Total deploy time on a fresh rack: about 4 days, most of which was waiting on the Azure AI Document Intelligence on-prem container to ingest the existing 12,000 documents.
The latency profile
This is the part that shocked us in the right direction.
Cloud RAG (the system before edge):
- Median time-to-first-token: 1.8 seconds
- p95: 3.4 seconds
- The dominant contributor: round-trip from site to cloud and back, plus the cloud's inference latency.
Edge RAG:
- Median time-to-first-token: 380ms
- p95: 720ms
- Dominant contributor: vector search in local AI Search, then the L40S GPU's inference time.
The 5x reduction is because we cut out the round-trip entirely. For a shift engineer querying the system at the start of a shift while machines are being prepped, that's the difference between "useful" and "I'll just ask a colleague."
User satisfaction surveys (n=85 over the first three months) ranked the edge system "much better" than the cloud version they'd had access to during evaluation. The latency was the most-cited reason.
The cost comparison over a year
Cloud RAG (counterfactual):
- AOAI gpt-4o usage at projected volumes: ~$8K/year
- Azure AI Search basic tier: ~$3K/year
- Azure AI Document Intelligence at projected document refresh rates: ~$2K/year
- Network egress for content viewing on shift floor: ~$1K/year
- Total cloud annual operating cost: ~$14K
Edge RAG:
- Hardware: $140K capital, amortized over 5 years = $28K/year
- Power and cooling at site: ~$3K/year (3 nodes at ~600W each, $0.10/kWh)
- Microsoft Edge RAG license: ~$8K/year
- Site engineer time for ops: estimated 0.1 FTE = ~$15K/year
- Internet bandwidth for telemetry sync: negligible
- Total edge annual operating cost: ~$54K
Edge RAG is roughly 4x more expensive at our scale. The justification is not cost — it's data residency and latency. The customer paid the premium for those properties.
If we were operating 10 such sites, the per-site economics would shift: shared engineer time, bulk hardware purchasing, and shared license terms could bring the per-site cost down to maybe $35K/year. Still more expensive than cloud, but the gap narrows.
What broke during the pilot
Document Intelligence on-prem container had unexpected memory behavior. It allocates aggressively. Our initial sizing put it on a node with other workloads; it OOMed three times the first week. Moved it to a dedicated node, problem solved. The cloud version is managed and we'd never seen this; on-prem you have to manage the resource envelope yourself.
The L40S thermal throttling. The original rack layout put both GPU-heavy nodes adjacent. They throttled under load because the front-of-rack air intake couldn't keep up. Re-racked them with cooling-friendly spacing. Zero issues since.
WAN outage during a deploy. A two-day site WAN outage hit during a planned rolling update of the chat frontend. The Arc agent paused reconciliation gracefully (which is what you want), but the cluster's existing services kept running uninterrupted. When the WAN came back, the agent caught up. Total downtime for the user-facing service: zero. This was a happy outcome — we'd been worried about WAN-failure scenarios.
What's hard about Edge RAG that isn't hard about Cloud RAG
Operator skill availability. Cloud RAG operations is a specialized but common skill. Edge AKS operations on Arc-attached clusters with on-prem container services is rarer. Our site has one trained operator and one cloud-based backup. Both can manage the system, but the recruiting pipeline for new edge operators is thin.
Patch management. Cloud services patch themselves (mostly). On-prem you patch them. We have a quarterly patch window for the cluster components. The on-prem AI services have their own patch cadence we have to track.
Hardware lifecycle. GPUs die. Disks die. Network cards die. In the cloud this is invisible. At the edge, you have an SLA with the hardware vendor and you better hope it covers your scenario. We've replaced one disk in four months, which is roughly what the failure-rate calculations predicted. No GPU failures yet.
What it's worth
For the customer's specific use case (regulated content, latency-sensitive shift work, capable site team), Edge RAG was the right answer despite the cost premium. The pilot is being expanded to two additional sites this year; the per-site cost will come down via shared infrastructure and cumulative learning.
If you're considering Edge RAG, ask yourself:
- Is your data legally restricted from leaving the site?
- Is your latency requirement tighter than cloud round-trip + inference can deliver?
- Do you have site-local operations capability or budget to develop it?
- Can you justify ~3-5x cloud cost for the privacy/latency benefits?
If yes to all four, Edge RAG on Arc is the most production-ready Microsoft option I've seen. If no to any, stay in the cloud.
What I'd do differently
Run the latency benchmark before committing to the hardware spec. We over-provisioned the GPUs slightly because we didn't know what latency the L40S would deliver; the L40S turned out to be very capable for our model size, and a smaller GPU with similar VRAM would have worked. About $20K of hardware we didn't need to buy.
I would NOT skip the WAN-outage drill. We did one in the first month. It surfaced two issues with our monitoring (we couldn't see edge metrics in the cloud while WAN was down — fixed by adding local Grafana) and one issue with our incident process (the on-call escalation path assumed Slack was reachable from the site — fixed). These were free lessons during a planned drill that would have been expensive lessons during a real outage.

Conversation
Reactions & commentsLiked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.