No. 14Azure AIOct 4, 202510 min read

Cost-Per-Query: Azure Agentic Retrieval in Foundry vs Hand-Rolled RAG at 100K Queries/Month

We benchmarked Azure AI Foundry's Agentic Retrieval against our hand-rolled RAG pipeline on the same workload, the same corpus, and the same evaluation set.

By Damilola Onadeinde

Microsoft MVP — DevOps

Share on X LinkedIn

We benchmarked Azure AI Foundry's Agentic Retrieval against our hand-rolled RAG pipeline on the same workload, the same corpus, and the same evaluation set. At 100K queries per month, the agentic option costs about 2.4x more per query but trades that cost for engineering time and operational complexity.

This is the cost breakdown, the quality comparison, and the decision framework for choosing between them.

The two systems

Hand-rolled. Azure AI Search vector index, custom retrieval logic (vector + keyword + reranking), Azure OpenAI gpt-4o for generation, custom prompt template, custom evaluation harness. About 3,200 lines of Python. Two engineers spent four months building it.

Foundry Agentic Retrieval. Azure AI Foundry agent with the AI Search Retrieval Agent tool. Same vector index. Configuration via Foundry portal + an SDK call to the agent. About 80 lines of Python.

Both connected to the same Azure AI Search index (24,000 documents, ~180,000 chunks).

Cost breakdown per 1,000 queries

Numbers below are from a 30-day production run, normalized to 1,000 queries:

Cost component	Hand-rolled	Foundry Agentic
Azure AI Search query units	$1.20	$1.20
Embedding for query	$0.40	$0.40
Generation tokens (gpt-4o)	$2.80	$4.50
Agent orchestration tokens	$0	$1.80
Azure AI Foundry tool-call overhead	$0	$0.60
Total per 1,000 queries	$4.40	$8.50
Annual cost at 100K/month	~$5,300	~$10,200

The agent path costs more in three places: more generation tokens (the agent thinks before it answers, and the thinking happens through gpt-4o), orchestration tokens (the framework's own context), and per-tool-call platform overhead.

Quality comparison

Same eval harness as the hand-rolled system (380 SME-written questions, scored on Groundedness, Relevance, and Coherence):

Metric	Hand-rolled	Foundry Agentic
Groundedness	0.78	0.81
Relevance	0.84	0.88
Coherence	0.91	0.92
p50 latency	2.1s	4.6s
p95 latency	4.3s	8.9s

Foundry wins on quality (modestly), loses on latency (notably), costs more.

The latency difference is structural: the agent makes multiple tool calls per query (sometimes querying the index multiple times with refined queries, sometimes asking clarifying questions in its own internal reasoning). That improves quality but doubles the wall-clock time.

Where Foundry wins beyond the numbers

Engineering time. The 80 lines of Python vs 3,200 lines is real. We could have built the agentic system in two engineer-days. We had four months in the hand-rolled one. If you're starting fresh and not committed to a hand-rolled investment, the time savings are massive.

Built-in observability. Foundry comes with traces, evals, and a UI for debugging individual queries. We had to build all of that ourselves on the hand-rolled side. The hand-rolled version's observability is good now but cost us roughly two engineer-weeks.

Built-in safety filters. Content filters, prompt-injection guardrails, hallucination detection — all configurable in the Foundry agent. We built equivalent functionality in the hand-rolled system but it's our problem to maintain.

Easier handoff. When the team rotates, a Foundry agent is simpler to onboard onto than 3,200 lines of bespoke retrieval logic. The platform team's preference is increasingly for things that are configuration, not code.

Where the hand-rolled system wins

Cost at scale. At 100K queries/month, Foundry costs $5K/year more. At 1M queries/month, it's $50K/year more. The cost crossover where engineering time savings stop justifying the per-query cost depends on your engineering rates and your scale.

Latency under SLA pressure. If you have a 3-second SLA, the hand-rolled p50 of 2.1s fits and Foundry's 4.6s does not. We have one customer-facing surface where this matters; it stays on the hand-rolled system.

Specific tuning levers. Want to change the rerank weights? Add a custom filter? Inject a per-tenant prompt? Hand-rolled lets you do that in code. Foundry lets you do most of it via configuration but not all of it. We've hit a couple of customizations that aren't possible in Foundry yet.

Cost predictability. Hand-rolled has predictable per-query cost. Foundry's agentic flow can take more or fewer tool calls per query — average is good, but the variance is real and shows up as occasional expensive queries.

The decision framework

Pick Foundry Agentic if:

You're starting from scratch (no existing investment)
Your query volume is < 250K/month
You don't have a tight latency SLA
Your engineering team is small and would rather configure than code
You value built-in observability and safety more than custom tuning

Pick hand-rolled if:

You already have an investment in custom retrieval
Your query volume is > 500K/month
You have a tight latency SLA (<3s)
You need specific tuning levers Foundry doesn't expose
You have engineering capacity for a custom system long-term

The middle case (250K-500K/month, no extreme SLA) is genuinely a judgment call. We made the call based on existing investment plus latency.

The hybrid we ended up with

We didn't pick one. Three of our six RAG-using applications run on Foundry Agentic (low volume, generous latency budget, internal-only). The other three run on the hand-rolled stack (high volume, customer-facing, custom tuning).

The hand-rolled team treats Foundry as a "preview of where we'll eventually be." The Foundry team treats hand-rolled as "the legacy fast path." Both attitudes are accurate.

When the Foundry latency improves (it's already gotten 30% better in the six months we've been on it), the case for hand-rolled gets weaker. We expect to migrate at least one of the high-volume surfaces in the next year if the latency continues to improve.

What I'd do differently

Run a 30-day side-by-side benchmark before committing to either path. We didn't do this; we picked hand-rolled first because Foundry didn't exist when we started, then added Foundry later for new workloads. A clean shootout would have given us more confidence faster.

I would NOT pick "agentic for everything" just because it's the new shape. The cost and latency penalties are real, and for some workloads they don't justify themselves.

RAGCostAgentic Retrieval

Conversation

Reactions & comments

Liked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.

Loading conversation…

More from Azure AI

See all →

Azure AINov 29, 2025

Killing API Keys in Azure OpenAI: A Managed Identity + Entra ID Migration Done Properly

Six months ago we had 14 services calling Azure OpenAI. All of them used API keys. The keys were stored in Azure Key Vault, fetched at startup, and rotated quarterly via a manual process that "everyone knew was fragile."

9 min read

Azure AINov 8, 2025

Per-Tenant Cost Attribution for Azure OpenAI Traffic Using APIM emit-token-metric-policy

Six months ago I couldn't tell you which tenant of our SaaS application was costing us the most in Azure OpenAI tokens. The number on the bill was real but unattributable.

9 min read

Azure AIOct 25, 2025

Fine-Tuning a Llama Variant With KAITO on AKS, Then Stress-Testing the Inference Endpoint

KAITO (Kubernetes AI Toolchain Operator) on AKS is the smoothest path I've found for "I want to fine-tune a Llama variant on my domain data and serve it as an inference endpoint." The whole pipeline — node provisioning, training, …

10 min read