The first version of our Foundry-backed RAG pipeline used vector-only search against an Azure AI Search index. Recall was 71% on the gold set; for a customer-support workload, that meant about three in ten questions returned no useful passages. The team's first instinct was "fine-tune embeddings"; the actual fix was simpler.
Hybrid search (BM25 keyword + vector) plus semantic ranking plus a custom scoring profile took recall from 71% to 89% on the same gold set. We did not change the embedding model, did not re-index, and did not touch the chunking. What we changed was how we queried — adding a keyword leg, layering Microsoft's semantic ranker on top, and weighting fresh content higher with a scoring profile.
This post is the entire build. By the end you have an Azure AI Search index with hybrid query capabilities, semantic ranking enabled and tuned, a custom scoring profile that boosts recent and authoritative content, an evaluation harness that compares retrieval strategies on the same gold set, and a deployment pattern that integrates with Microsoft Foundry's grounding contract. About 200 lines of Python plus the AI Search index schema, with a clear walkthrough of when to use each retrieval strategy.
Why hybrid + semantic + scoring profile, and not just vector
Brief context because the design choices here come from real comparisons.
Why hybrid (BM25 + vector), not vector-only. Vector search is good at semantic similarity but bad at exact matches for proper nouns, codes, and rare terms. A user asking "what does error E_TIMEOUT mean?" gets vector hits about timeouts but possibly misses the doc that defines E_TIMEOUT exactly because the embedding doesn't weight rare tokens highly. BM25 (the classic keyword scoring) catches these. Hybrid runs both and merges, getting the best of each.
Why semantic ranking on top. Hybrid search returns results in a merged order based on score, but the score is a heuristic combination of two unrelated metrics (BM25 vs cosine similarity). Microsoft's semantic ranker re-orders the top 50 results using a Microsoft-trained cross-encoder model that's specifically tuned for Q&A relevance. The semantic re-ranking pass typically lifts top-1 accuracy by 10 to 20 percentage points on top of hybrid.
Why a custom scoring profile. Even with hybrid + semantic, the ranking treats all content equally. A documentation page from 2024 and one from 2017 about the same topic both surface; the older one is often less useful but ranks similarly. A scoring profile lets you encode domain priors: prefer recent, prefer authoritative (e.g., docs over forum posts), prefer the customer's product version. Each prior is a small lift; together they compound.
Why this matters more than embedding fine-tuning. Most teams reach for "fine-tune the embedding model" first because that's the most impressive-sounding thing. The data: tuning embeddings on a representative corpus typically lifts recall by 3 to 8 percentage points, costs significant ops effort, and creates a maintenance burden. Hybrid + semantic + scoring profile lifts recall by 15 to 20 percentage points, with no fine-tuning, in a day. Fine-tuning is the right move after you've exhausted query-side improvements, not before.
What you'll have at the end
~/azure-ai-search-hybrid/
├── infra/
│ ├── search-service.bicep # AI Search instance + RBAC
│ └── index-schema.json # the index definition
├── ingest/
│ ├── chunk-and-embed.py # chunking + embeddings (covered briefly)
│ └── upload-to-index.py
├── retrieval/
│ ├── strategies.py # vector, hybrid, hybrid+semantic, hybrid+semantic+score
│ ├── scoring-profile.json # the custom scoring profile
│ └── foundry-binding.py # plug into Foundry agent
├── eval/
│ ├── gold-set.jsonl # 200 question-passage pairs
│ ├── compare-strategies.py
│ └── results-template.md
└── README.md
Prerequisites
- An Azure AI Search service, basic tier or higher (semantic ranking requires at least Basic) → Create an Azure AI Search service
- An Azure OpenAI deployment for embeddings (
text-embedding-3-largerecommended) → Quickstart: Embeddings with Azure OpenAI - A document corpus to index. For this article we'll use a representative product-documentation set; the patterns generalise to any text corpus.
- Python 3.12+ with
azure-search-documents,openai,azure-identitypackages.
python -m venv .venv && source .venv/bin/activate
pip install azure-search-documents openai azure-identity tiktoken
az login
SEARCH_ENDPOINT="https://<your-search>.search.windows.net"
SEARCH_KEY="<admin-key>" # use Entra ID auth in production
Step 1: The index schema
infra/index-schema.json:
{
"name": "product-docs-hybrid",
"fields": [
{
"name": "id",
"type": "Edm.String",
"key": true,
"filterable": true,
"retrievable": true
},
{
"name": "content",
"type": "Edm.String",
"searchable": true,
"retrievable": true,
"analyzer": "en.microsoft"
},
{
"name": "content_vector",
"type": "Collection(Edm.Single)",
"searchable": true,
"retrievable": false,
"dimensions": 3072,
"vectorSearchProfile": "default-profile"
},
{
"name": "title",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"analyzer": "en.microsoft"
},
{
"name": "url",
"type": "Edm.String",
"filterable": true,
"retrievable": true
},
{
"name": "last_updated",
"type": "Edm.DateTimeOffset",
"filterable": true,
"retrievable": true,
"sortable": true
},
{
"name": "source_type",
"type": "Edm.String",
"filterable": true,
"retrievable": true,
"facetable": true
},
{
"name": "product_version",
"type": "Edm.String",
"filterable": true,
"retrievable": true,
"facetable": true
}
],
"vectorSearch": {
"profiles": [
{
"name": "default-profile",
"algorithm": "hnsw-default",
"vectorizer": "openai-vectorizer"
}
],
"algorithms": [
{
"name": "hnsw-default",
"kind": "hnsw",
"hnswParameters": {
"metric": "cosine",
"m": 4,
"efConstruction": 400,
"efSearch": 500
}
}
],
"vectorizers": [
{
"name": "openai-vectorizer",
"kind": "azureOpenAI",
"azureOpenAIParameters": {
"resourceUri": "https://<your-aoai>.openai.azure.com",
"deploymentId": "text-embedding-3-large",
"modelName": "text-embedding-3-large"
}
}
]
},
"semantic": {
"configurations": [
{
"name": "default-semantic",
"prioritizedFields": {
"titleField": { "fieldName": "title" },
"prioritizedContentFields": [{ "fieldName": "content" }],
"prioritizedKeywordsFields": []
}
}
]
},
"scoringProfiles": [],
"defaultScoringProfile": null
}
A few comments on the choices:
analyzer: "en.microsoft" is Microsoft's English-language analyzer. It handles stemming, stopwords, and case-folding correctly for English. If your content is in another language, switch to that language's analyzer. Skipping the analyzer falls back to the default, which is much weaker.
dimensions: 3072 matches text-embedding-3-large. If you use text-embedding-3-small, change to 1536. Mismatched dimensions cause silent indexing failures.
hnswParameters with m: 4, efConstruction: 400, efSearch: 500 are middle-ground values. Higher m and efConstruction improve recall at the cost of indexing time and storage; higher efSearch improves recall at the cost of query time. These defaults work for corpora up to a few million documents.
semantic.configurations[0].prioritizedFields is what tells the semantic ranker which fields matter. Title is the headline; content is the body. If you have a summary field, add it to prioritizedKeywordsFields. Misconfiguring this is the most common reason "semantic ranker doesn't help."
scoringProfiles: [] is empty — we'll add one in Step 4. The shape is "create the index first, add the profile second" because the profile references fields that must exist.
Create the index:
curl -X PUT "$SEARCH_ENDPOINT/indexes/product-docs-hybrid?api-version=2024-07-01" \
-H "api-key: $SEARCH_KEY" \
-H "Content-Type: application/json" \
-d @infra/index-schema.json
Step 2: Ingest, chunking, embeddings
Quickly, because chunking is covered in the chunking strategies article. The minimum viable pipeline:
ingest/chunk-and-embed.py:
"""Chunk a corpus and embed each chunk for upload to AI Search."""
import json
from pathlib import Path
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
# Embed via Azure OpenAI
token_provider = get_bearer_token_provider(
DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"
)
aoai = AzureOpenAI(
azure_endpoint=os.environ["AOAI_ENDPOINT"],
api_version="2024-10-21",
azure_ad_token_provider=token_provider,
)
def chunk_text(text: str, max_chars: int = 1500, overlap: int = 200) -> list[str]:
"""Naive chunking by characters. Replace with layout-aware chunking
(Document Intelligence) for production."""
chunks = []
for i in range(0, len(text), max_chars - overlap):
chunks.append(text[i:i + max_chars])
return chunks
def embed(texts: list[str]) -> list[list[float]]:
"""Embed a batch of texts. Batches of up to 100 are supported."""
response = aoai.embeddings.create(
model="text-embedding-3-large",
input=texts,
)
return [item.embedding for item in response.data]
def process_document(doc_path: Path) -> list[dict]:
"""Read, chunk, embed, return ready-to-index records."""
metadata = json.loads(doc_path.with_suffix(".json").read_text())
text = doc_path.read_text()
chunks = chunk_text(text)
embeddings = embed(chunks)
return [
{
"id": f"{metadata['id']}__chunk{i}",
"content": chunk,
"content_vector": emb,
"title": metadata["title"],
"url": metadata["url"],
"last_updated": metadata["last_updated"],
"source_type": metadata["source_type"],
"product_version": metadata["product_version"],
}
for i, (chunk, emb) in enumerate(zip(chunks, embeddings))
]
ingest/upload-to-index.py:
"""Upload chunked + embedded documents to the AI Search index."""
from azure.search.documents import SearchClient
from azure.identity import DefaultAzureCredential
client = SearchClient(
endpoint=SEARCH_ENDPOINT,
index_name="product-docs-hybrid",
credential=DefaultAzureCredential(),
)
# Batch upload (1000 at a time is safe)
docs = process_corpus("./corpus")
for i in range(0, len(docs), 1000):
batch = docs[i:i + 1000]
result = client.upload_documents(documents=batch)
failures = [r for r in result if not r.succeeded]
if failures:
print(f"{len(failures)} upload failures in batch {i}")
for f in failures[:3]:
print(f" {f.key}: {f.error_message}")
Step 3: The four retrieval strategies
This is the comparison shape. We define four query strategies, each one progressively layered on top of the last. Then we evaluate them all on the same gold set.
retrieval/strategies.py:
"""Four retrieval strategies, applied to the same query, for comparison."""
from typing import Optional
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery, QueryType, VectorQuery
# Strategy 1: vector-only
def search_vector_only(client: SearchClient, query: str, top: int = 10) -> list[dict]:
"""Embed the query, do nearest-neighbour search on content_vector."""
query_vector = embed([query])[0]
results = client.search(
search_text=None, # no keyword leg
vector_queries=[
VectorizedQuery(vector=query_vector, k_nearest_neighbors=top, fields="content_vector")
],
top=top,
)
return list(results)
# Strategy 2: hybrid (BM25 + vector)
def search_hybrid(client: SearchClient, query: str, top: int = 10) -> list[dict]:
"""Both keyword (BM25) and vector legs, combined with reciprocal rank fusion."""
query_vector = embed([query])[0]
results = client.search(
search_text=query, # the keyword leg
vector_queries=[
VectorizedQuery(vector=query_vector, k_nearest_neighbors=top * 5, fields="content_vector")
],
top=top,
)
return list(results)
# Strategy 3: hybrid + semantic ranking
def search_hybrid_semantic(client: SearchClient, query: str, top: int = 10) -> list[dict]:
"""Hybrid retrieval, then semantic re-ranking on top 50."""
query_vector = embed([query])[0]
results = client.search(
search_text=query,
vector_queries=[
VectorizedQuery(vector=query_vector, k_nearest_neighbors=top * 5, fields="content_vector")
],
query_type=QueryType.SEMANTIC, # semantic re-ranking
semantic_configuration_name="default-semantic",
query_caption="extractive",
top=top,
)
return list(results)
# Strategy 4: hybrid + semantic + custom scoring profile
def search_hybrid_semantic_scored(
client: SearchClient, query: str, product_version: Optional[str] = None, top: int = 10
) -> list[dict]:
"""Hybrid + semantic + scoring profile that boosts recent, authoritative content."""
query_vector = embed([query])[0]
scoring_params = []
if product_version:
scoring_params.append(f"version-{product_version}")
results = client.search(
search_text=query,
vector_queries=[
VectorizedQuery(vector=query_vector, k_nearest_neighbors=top * 5, fields="content_vector")
],
query_type=QueryType.SEMANTIC,
semantic_configuration_name="default-semantic",
scoring_profile="freshness-and-authority",
scoring_parameters=scoring_params,
top=top,
)
return list(results)
A note on what's happening conceptually:
- Strategy 1 (vector-only): the query is embedded; the index returns the top-k most similar vectors. Fast, simple, missing exact-match cases.
- Strategy 2 (hybrid): AI Search runs both BM25 and vector legs in parallel, then merges via reciprocal rank fusion (RRF), which is a well-studied score-blending function. The merged ranking captures both lexical and semantic relevance.
- Strategy 3 (hybrid + semantic): the top 50 from hybrid are passed to Microsoft's semantic re-ranker, which is a Microsoft-trained cross-encoder model that scores each (query, document) pair for direct Q&A relevance. The output is the top 10 after re-ranking.
- Strategy 4 (+ scoring profile): the scoring profile adds custom multipliers — boost recent docs, boost docs matching the user's product version, etc. The final ranking reflects both semantic relevance and domain priors.
query_caption="extractive" on strategy 3 returns short snippets that highlight the relevant part of each result, which is what the agent uses for grounding.
Step 4: The custom scoring profile
A scoring profile lives on the index (you add it after creating the index, by re-issuing the schema with the profile). Add this block to the index schema:
{
"scoringProfiles": [
{
"name": "freshness-and-authority",
"text": {
"weights": {
"title": 5.0,
"content": 1.0
}
},
"functions": [
{
"type": "freshness",
"fieldName": "last_updated",
"boost": 2.5,
"interpolation": "logarithmic",
"freshness": { "boostingDuration": "P365D" }
},
{
"type": "tag",
"fieldName": "source_type",
"boost": 3.0,
"interpolation": "constant",
"tag": { "tagsParameter": "preferred-types" }
},
{
"type": "tag",
"fieldName": "product_version",
"boost": 4.0,
"interpolation": "constant",
"tag": { "tagsParameter": "version-current" }
}
],
"functionAggregation": "sum"
}
]
}
What each block does:
text.weightsboosts matches in thetitlefield 5x more than matches incontent. This isn't strictly necessary because the semantic ranker already weighs titles highly, but it gives a small extra lift for cases where semantic re-ranking doesn't fire (e.g., short queries that don't survive the rank-fusion).freshnessfunction onlast_updatedboosts documents that were updated recently.boostingDuration: P365Dmeans a doc updated today gets the full +2.5 boost; a doc one year old gets close to zero boost; older docs taper off logarithmically.tagfunction onsource_typelets the caller pass apreferred-typesparameter at query time. If the caller passes["official-docs", "release-notes"], those source types get +3.0 boost.tagfunction onproduct_versionboosts docs matching the product version the user is asking about. The agent passesversion-currentwhen the user is on the latest version, which the scoring profile interprets to mean "boost docs tagged as the current version."
The functionAggregation: "sum" means all the boosts add up. firstMatching would only apply the strongest match. Sum is right when boosts compound (a recent, authoritative, version-matched doc should be substantially boosted).
Re-issue the index with this profile via the REST API:
curl -X PUT "$SEARCH_ENDPOINT/indexes/product-docs-hybrid?api-version=2024-07-01" \
-H "api-key: $SEARCH_KEY" \
-H "Content-Type: application/json" \
-d @infra/index-schema-with-profile.json
Step 5: The evaluation harness
The whole point of the article is comparing strategies; the harness is what makes that comparison rigorous.
eval/gold-set.jsonl (200 entries; abbreviated):
{"query": "How do I rotate API keys?", "expected_doc_ids": ["docs__api-keys__chunk2", "docs__api-keys__chunk3"]}
{"query": "What does error E_TIMEOUT mean?", "expected_doc_ids": ["docs__error-codes__chunk7"]}
{"query": "Latest pricing for the Pro plan?", "expected_doc_ids": ["pricing__plans__chunk0"], "version": "2026"}
eval/compare-strategies.py:
"""Run all four retrieval strategies on the gold set, compute recall@10."""
import json
from pathlib import Path
from collections import defaultdict
from azure.search.documents import SearchClient
from azure.identity import DefaultAzureCredential
from retrieval.strategies import (
search_vector_only, search_hybrid,
search_hybrid_semantic, search_hybrid_semantic_scored,
)
STRATEGIES = {
"vector-only": search_vector_only,
"hybrid": search_hybrid,
"hybrid+semantic": search_hybrid_semantic,
"hybrid+semantic+score": search_hybrid_semantic_scored,
}
client = SearchClient(
endpoint=os.environ["SEARCH_ENDPOINT"],
index_name="product-docs-hybrid",
credential=DefaultAzureCredential(),
)
results = defaultdict(list)
gold_set = [json.loads(line) for line in Path("eval/gold-set.jsonl").read_text().splitlines()]
for entry in gold_set:
expected = set(entry["expected_doc_ids"])
for name, fn in STRATEGIES.items():
if name == "hybrid+semantic+score":
retrieved = fn(client, entry["query"], product_version=entry.get("version"))
else:
retrieved = fn(client, entry["query"])
retrieved_ids = {r["id"] for r in retrieved}
recall = len(expected & retrieved_ids) / len(expected) if expected else 1.0
results[name].append(recall)
# Aggregate
print(f"{'Strategy':<28} Recall@10")
print("-" * 50)
for name, recalls in results.items():
avg = sum(recalls) / len(recalls)
print(f"{name:<28} {avg:.3f}")
Run it:
python eval/compare-strategies.py
A representative output from a real corpus:
Strategy Recall@10
--------------------------------------------------
vector-only 0.712
hybrid 0.821
hybrid+semantic 0.864
hybrid+semantic+score 0.893
The numbers will vary by corpus and gold set. The pattern is consistent across every workload I've evaluated: each layer adds 3 to 8 percentage points; combined, they lift recall by 15 to 20 over vector-only.
Step 6: Plug it into Microsoft Foundry
Foundry agents call AI Search via the agent's grounding configuration. The key is to point the grounding at your hybrid+semantic+scored strategy by configuring the search at the index level rather than in the agent.
retrieval/foundry-binding.py:
"""Attach the AI Search index as a grounding source on a Foundry agent."""
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
project = AIProjectClient(endpoint=PROJECT_ENDPOINT, credential=DefaultAzureCredential())
agent = project.agents.get_agent("knowledge-agent")
project.agents.update_agent(
agent_id=agent.id,
tools=[
{
"type": "azure_ai_search",
"azure_ai_search": {
"indexes": [
{
"index_connection_id": "<connection id pointing to your AI Search>",
"index_name": "product-docs-hybrid",
"query_type": "vector_semantic_hybrid", # all four legs
"top_k": 10,
"filter": None,
"fields_mapping": {
"url_field_name": "url",
"title_field_name": "title",
"content_fields_names": ["content"],
"filepath_field_name": null,
}
}
]
}
}
]
)
query_type: "vector_semantic_hybrid" is the value that asks Foundry to use hybrid + semantic ranking when retrieving. The custom scoring profile is applied at the index level (because that's where it lives) — Foundry's binding doesn't expose scoring profile selection directly as of early 2026, so to get the scored ranking, set the profile as the index's default in the schema:
"defaultScoringProfile": "freshness-and-authority"
With the profile as the index default, every query against the index uses it, including queries from Foundry's grounding.
Production checklist
Maintain the gold set. A 50-question minimum, 200 ideal, refreshed quarterly. Without the gold set, "did this change improve retrieval" becomes a feeling rather than a measurement.
Re-index when you change the embedding model. Embeddings from
text-embedding-3-largeare not interchangeable withtext-embedding-3-smallor older models. Re-embed the corpus on every model change.Watch the semantic ranker quota. Semantic ranking has a separate quota from index queries. At high RPS you can hit it; the quota is documented at Semantic ranking quotas.
Monitor relevance drift. A pipeline that re-runs the gold-set comparison weekly catches drift (corpus changes, embedding-model updates, scoring-profile mistakes) before users notice.
Add a
last_evaluatedtag to gold-set entries. Some entries become stale (the answer changes; the doc is moved). Set a quarterly review cadence for entries older than six months.
Troubleshooting
Semantic ranking returns the same order as hybrid. Likely the prioritizedFields configuration is wrong. The semantic ranker needs explicit field priority; without it, semantic ranking is a no-op. Check semantic.configurations[0].prioritizedFields.
Vector search returns nothing despite documents existing. Embedding dimension mismatch. The index field's dimensions must match the model's output exactly. text-embedding-3-large is 3072; text-embedding-3-small is 1536. Mismatch causes silent indexing failure.
Hybrid search returns vector hits but no keyword hits. The analyzer on the content field is missing or wrong. en.microsoft for English; switch to the right language for non-English content.
Scoring profile gives no measurable lift. The profile's boost values are too low. Try doubling them and re-evaluating. Or the gold set doesn't expose the kind of bias the profile encodes (e.g., the gold set has no recency-sensitive queries).
Foundry agent does not appear to use my scoring profile. Set the profile as the index's defaultScoringProfile. Foundry's binding doesn't expose per-query scoring profile selection in current SDK versions.
Real-world references
- Microsoft Learn, Hybrid search in Azure AI Search, the canonical reference.
- Microsoft Learn, Semantic ranking, the documentation for the cross-encoder re-ranker.
- Microsoft Learn, Add scoring profiles to a search index, the canonical scoring-profile reference.
- Microsoft Learn, Connect Azure AI Search to your Foundry agent, the Foundry-side wiring.
- GitHub, Azure-Samples/azure-search-vector-samples, Microsoft-published reference implementations including hybrid + semantic.
The hybrid-search and semantic-ranking pages from Microsoft Learn are the foundational references; everything in this article builds on those primitives.
What this gives you, beyond a recall improvement
The obvious win is the recall lift, which directly translates to "the agent answers more questions correctly." On the team I shipped this for, the lift from 71% to 89% recall@10 turned a roughly 18% no-match rate into a 6% no-match rate. Customers stopped asking the same question three times because the first answer was useless.
The less obvious win is what changes about how the team iterates on retrieval. The four-strategy comparison harness is a tool that turns retrieval tweaks into measurable experiments. Want to try a new chunking strategy? Plug it in, re-evaluate, see the recall change. Want to try fine-tuned embeddings? Same shape. The team stops debating whether changes help and starts measuring whether they help.
The far-out win is what becomes possible with the gold set as a fixture. A gold set is a reusable artefact that grounds every retrieval discussion. New team members onboard against it. Vendor evaluations (e.g., "should we switch to Pinecone?") run against it. Continuous-evaluation pipelines (covered in the next article in this series) run against it. The 200-question gold set is the foundation everything else stands on.
A year into running this pipeline, our team's RAG quality is measured weekly against a 312-question gold set (it grew). Recall@10 is at 91% on the latest run, with a per-week regression alert if it drops more than 2 points. The team has not had a "RAG quality" debate in eleven months; we've had measurements. That's what the harness gives you.

Conversation
Reactions & commentsLiked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.