Skip to content
A
No. 12Azure AISep 20, 20259 min read

Building a Groundedness Eval Harness Around Azure AI Search Retrieval Agent

A RAG application that doesn't have an evaluation harness isn't a production system — it's a demo with extra steps. You can't tune what you can't measure, and "the model said something reasonable" is not a measurement.

A RAG application that doesn't have an evaluation harness isn't a production system — it's a demo with extra steps. You can't tune what you can't measure, and "the model said something reasonable" is not a measurement.

We built an eval harness for our Azure AI Search Retrieval Agent six months ago. It runs nightly, costs about $4/night, and has caught three regressions before they reached users. Here's the architecture, the tricky decisions, and the lessons from running it for half a year.

What the harness does

Five things, in order, every night:

  1. Loads our golden eval set (380 questions with expected answers, written by SMEs).
  2. Runs each question through the production retrieval pipeline.
  3. Calls the production generation pipeline (Azure OpenAI gpt-4o with our system prompt).
  4. Scores each response using the Azure AI Foundry Groundedness, Relevance, and Coherence evaluators.
  5. Posts results to a Postgres table and a Grafana dashboard, plus a Slack message if anything regressed.

Total runtime: about 90 minutes for 380 questions. Total cost: ~$4 in tokens. Worth every cent.

The golden eval set is the entire game

You can fake every other component of an eval harness. You cannot fake the eval set. If your questions don't represent real user queries, your numbers don't represent real quality.

We built ours by:

  1. Pulling 4,000 actual user queries from production logs over a 30-day window.
  2. Sampling 500 stratified by topic (so the eval set isn't dominated by the most-common question).
  3. Having two SMEs from each domain area write the "correct answer" for each question, with a citation to the source document where it lives.
  4. Discarding questions where the SMEs couldn't agree on the answer (about 25% — these are inherently ambiguous and shouldn't be in the eval).
  5. Discarding questions whose answers didn't exist in the corpus (the SMEs would just write "no answer in available docs").

We ended up with 380 questions and ~12 hours of SME time invested. That's the most expensive part of the entire harness, and it's also the only non-replaceable part.

The eval set lives in a Git repo (rag-eval/golden-set.jsonl). PRs against it require sign-off from the relevant SMEs. We treat it like the test suite for the entire RAG product, because that's what it is.

The eval loop

import asyncio
from azure.ai.evaluation import GroundednessEvaluator, RelevanceEvaluator, CoherenceEvaluator
from azure.identity import DefaultAzureCredential

# Production-identical retrieval + generation
async def query_pipeline(question: str) -> dict:
    chunks = await retrieve(question, top_k=10)
    response = await generate(question, chunks)
    return {"question": question, "answer": response, "context": chunks}

# Production-identical evaluators
groundedness = GroundednessEvaluator(model_config=eval_model_config)
relevance = RelevanceEvaluator(model_config=eval_model_config)
coherence = CoherenceEvaluator(model_config=eval_model_config)

async def evaluate_one(item):
    result = await query_pipeline(item["question"])
    g = await groundedness(response=result["answer"], context=result["context"])
    r = await relevance(response=result["answer"], context=result["context"], query=item["question"])
    c = await coherence(response=result["answer"])
    return {**item, **result, "groundedness": g, "relevance": r, "coherence": c}

results = await asyncio.gather(*(evaluate_one(item) for item in eval_set))

The harness uses a separate Azure OpenAI deployment (gpt-4o again, but in a different region) for the evaluators. This matters: never use the same deployment for generation and evaluation. The evaluator should be independent of the generator's specific quirks.

What we measure and why

Groundedness. Of the model's response, what fraction is supported by the retrieved context? This is the single most important metric for a RAG system. A groundedness score below 0.7 means the model is hallucinating.

Relevance. Of the retrieved context, how relevant is it to the question? This catches retrieval-side regressions. If groundedness drops AND relevance drops, the retrieval changed. If groundedness drops but relevance is stable, the model started ignoring its context.

Coherence. Is the response itself well-formed? This catches generator-side regressions independent of retrieval.

Custom: citation accuracy. For each cited document in the response, does the cited document actually contain the cited claim? This is a custom evaluator we wrote in ~50 lines. It catches a specific failure where the model fabricates a citation that "looks right" but points to the wrong source.

What the harness has caught

Regression 1: a chunking-config change dropped groundedness from 0.78 to 0.61. A platform engineer changed the semantic-chunking similarity threshold from 0.6 to 0.5 because the new value "felt better." The harness flagged the regression next morning. Reverted the change in 10 minutes. Without the harness we'd have shipped degraded quality and not noticed for weeks.

Regression 2: a model upgrade introduced citation hallucination. When gpt-4o was updated server-side, our citation-accuracy custom evaluator flagged a 12-point drop. The new model was citing documents that didn't contain the claims. We added a stronger system prompt ("Only cite documents whose content directly supports the cited claim. If unsure, do not cite.") and citation accuracy recovered.

Regression 3: a prompt template change tanked coherence. Someone added 200 tokens of corporate boilerplate to the system prompt. Coherence dropped because the model now spent its output budget repeating the boilerplate. Pulled the change.

In each case, the harness caught the regression within 24 hours. Without it, all three would have shipped to production users.

The cost-vs-coverage trade-off

Running the full harness on every PR would be ideal but costs $4 per run × 100 PRs/month = $400/month. We instead run:

  • Nightly: full 380-question run.
  • Per-PR: a 50-question subset, focused on the changed component (retrieval changes → retrieval-heavy questions; prompt changes → prompt-sensitive questions). About $0.50 per PR.
  • Weekly: full run plus a regression-detection comparison against the previous week. Auto-files a ticket if any metric drops by >5%.

This three-tier approach catches fast issues fast and slow issues thoroughly. The annual cost is around $1,400 in evaluation tokens. Compared to the engineer time saved by catching regressions before users do, it's free.

What I'd build differently

Add a synthetic-question generator to keep the eval set fresh. Right now the 380-question set is six months old. The product has changed; the corpus has grown. Some questions reference deprecated content. We're considering a quarterly process where we re-sample 50 questions from recent production logs and rotate them in.

I would NOT delete questions whose "correct answer" has changed because the corpus changed. Those questions are themselves evidence of how the system handles content evolution. Mark them, don't drop them.

The lesson

A RAG eval harness is not optional. The cost is one engineer-week of setup and ~$1,500/year in tokens. The benefit is the difference between "we think our system is good" and "we know our system is good, and we'll know within 24 hours when it isn't."

If you don't have one yet, start with 50 SME-written questions. That's enough to catch most regressions. Grow it from there.

RAGEvaluationsGroundedness

Conversation

Reactions & comments

Liked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.

Loading conversation…

More from Azure AI

See all →