No. 11Azure AISep 13, 202511 min read

Chunking Strategies on Azure AI Search RAG: What Actually Moved Groundedness Scores in Our Pipeline

We ran four chunking strategies against the same 12,000-document corpus, scored each against the same 200-question evaluation set, and watched groundedness scores move from 0.41 to 0.78 by changing nothing but how the documents we…

By Damilola Onadeinde

Microsoft MVP — DevOps

Share on X LinkedIn

If you're tuning a RAG pipeline on Azure AI Search and wondering whether chunking strategy matters: yes, it does. Probably more than your retrieval algorithm. Here's what we tested and what won.

The setup, briefly

Azure AI Search as the vector store. Azure OpenAI gpt-4o as the generator. A 200-question eval set written by domain experts, where each question has a known-correct answer that lives in a specific paragraph of a specific document.

For each chunking strategy, we re-indexed the entire corpus, ran the 200 questions through the same retrieval-and-generation pipeline, and measured:

Groundedness: of the model's response, how much is supported by the retrieved chunks (Azure AI Foundry's Groundedness evaluator, scale 0-1).
Recall@10: of the 200 questions, how many had the correct chunk in the top 10 results.
Token cost: total tokens spent across all 200 questions.

Strategy 1: fixed-size, no overlap (the baseline)

Chunks of 512 tokens, no overlap. The simplest possible approach.

Groundedness: 0.41
Recall@10: 62%
Cost per 200 questions: $1.20

Failure mode: chunks regularly cut answers in half. A user asks "what is the maximum batch size?" The answer is "The system supports a maximum batch size of 1,024 records, configurable via the BATCH_SIZE environment variable." If the chunk boundary falls between "1,024" and "records", the retrieved chunk has the number without the unit, and the model either hallucinates or refuses to answer.

This was our starting point. Not a strawman — this is what most "intro to RAG" tutorials produce.

Strategy 2: fixed-size with overlap

Chunks of 512 tokens with 128-token overlap (25%).

Groundedness: 0.54
Recall@10: 73%
Cost per 200 questions: $1.45 (more chunks indexed → more potential context)

The overlap fixes the "answer cut in half" problem most of the time. Still not great because some answers span more than 128 tokens, and the chunks remain semantically arbitrary — they don't respect document structure.

We thought we'd be done after this step. We were wrong.

Strategy 3: paragraph-aware splitting

Chunks aligned on paragraph boundaries with a 256-token minimum and 768-token maximum. If a paragraph is too long, split on sentence boundaries instead. If a paragraph is too short, merge with adjacent paragraphs until reaching the minimum.

Groundedness: 0.69
Recall@10: 81%
Cost per 200 questions: $1.30

This was a noticeable jump. The reason was simple: paragraphs are usually self-contained semantic units. A paragraph about "configuring batch size" stays intact, with all its qualifiers ("the maximum is 1,024 records, but throughput is best at 256-512 for our typical workload"). Fixed-size chunking had been splitting on character count and losing context.

The implementation is unglamorous: a Python script that walks each document, builds paragraphs, applies the size constraints, and emits chunks. Maybe 80 lines including edge cases.

Strategy 4: semantic chunking with a small embedding model

For each document: embed each sentence with a small, fast embedding model (text-embedding-3-small). Compute cosine similarity between adjacent sentences. Place a chunk boundary where similarity drops below a threshold (we tuned to 0.6 after experimentation).

This produces chunks that are semantic-coherent: each chunk is a "topic." It's slower at index time (embedding every sentence is ~3x the cost of plain ingestion) but does it once per document, never again.

Groundedness: 0.78
Recall@10: 87%
Cost per 200 questions: $1.25 (fewer, more relevant chunks retrieved → fewer reranking iterations)

This is the configuration we run in production today. The 0.78 groundedness is meaningfully higher than 0.69, and the 87% recall means the top-10 has the right chunk for nearly every question.

The threshold tuning matters. At 0.5, chunks were too long. At 0.7, chunks were too short. 0.6 was the sweet spot for our corpus. Re-tune for your domain.

What we tried that didn't help

Hierarchical chunking (parent-document with smaller child chunks for retrieval). Sounded clever. Added 30% to indexing complexity. Groundedness improvement was within noise (0.78 → 0.79). Not worth the operational complexity.

Chunk summarization (each chunk has an LLM-generated summary attached as additional searchable text). Cost too much at index time, and the summaries occasionally drifted from the source — introducing potential hallucination paths into the retrieval layer itself. Pulled it back out.

Page-level chunking (one chunk per PDF page). Naïve and bad — pages are a print-layout artifact, not a semantic unit. Groundedness dropped to 0.51.

Why semantic chunking won

A document's semantic structure is its content's natural granularity. Sentences within a paragraph share a topic. Paragraphs within a section share a sub-topic. Sections within a document share a top-level theme.

Fixed-size chunking ignores all of that. It cuts where the byte count says, regardless of meaning. The cost is that retrieved chunks frequently have partial answers that get the model in trouble.

Semantic chunking respects the structure that the document's author already encoded. The chunks it produces are the units the author intended to be read together. The model's grounding improves because the context it gets is coherent.

The production setup

The indexing pipeline runs as an Azure Function triggered by Blob Storage events. New document → semantic chunking → embedding → upsert into Azure AI Search index. About 14 seconds per average-sized document (40-page PDF).

The retrieval-and-generation side is unchanged. Azure AI Search vector query → top 10 chunks → re-rank with cross-encoder → top 4 to GPT-4o → response with citations.

The whole pipeline is about 600 lines of Python. The chunking module is the most complex component. Worth getting right.

What I'd test next

Adaptive chunk sizing per document type. Technical docs probably want different chunk-size minimums than transcript-style docs. Right now we use the same constraints for everything. I suspect we'd see another 3-5 points of groundedness by tuning per content type.

Chunk metadata on retrieval. Each chunk includes its document title, section heading, and page number as metadata. Currently we don't surface this in the prompt. Doing so might help the model cite more accurately.

I would NOT switch off semantic chunking for cost reasons. The +0.09 groundedness is worth far more than the $0.20 of indexing cost it adds per 1,000 documents. Hallucination is the expensive failure mode in RAG; chunking is the cheapest place to reduce it.

The portable lesson

Before tuning your retrieval algorithm, your re-ranker, your prompt, or your model — tune your chunks. The default 512-tokens-no-overlap setup that most tutorials show is the baseline you should beat, not the configuration you should ship.

Spend a week on chunking. Re-run your eval set. The numbers move.

RAGAzure AI SearchChunking

Conversation

Reactions & comments

Liked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.

Loading conversation…

More from Azure AI

See all →

Azure AINov 29, 2025

Killing API Keys in Azure OpenAI: A Managed Identity + Entra ID Migration Done Properly

Six months ago we had 14 services calling Azure OpenAI. All of them used API keys. The keys were stored in Azure Key Vault, fetched at startup, and rotated quarterly via a manual process that "everyone knew was fragile."

9 min read

Azure AINov 8, 2025

Per-Tenant Cost Attribution for Azure OpenAI Traffic Using APIM emit-token-metric-policy

Six months ago I couldn't tell you which tenant of our SaaS application was costing us the most in Azure OpenAI tokens. The number on the bill was real but unattributable.

9 min read

Azure AIOct 25, 2025

Fine-Tuning a Llama Variant With KAITO on AKS, Then Stress-Testing the Inference Endpoint

KAITO (Kubernetes AI Toolchain Operator) on AKS is the smoothest path I've found for "I want to fine-tune a Llama variant on my domain data and serve it as an inference endpoint." The whole pipeline — node provisioning, training, …

10 min read