Skip to content
damionas
No. 54Azure AI FoundryMar 30, 202626 min read

Continuous Evaluation for Foundry Agents With Prompt Flow and GitHub Actions

The Foundry agent we shipped to a customer-success team passed every hand-tested scenario before launch. Six weeks later, the team's manager pulled me aside: "the agent's getting worse at billing questions, but we can't tell when …

The Foundry agent we shipped to a customer-success team passed every hand-tested scenario before launch. Six weeks later, the team's manager pulled me aside: "the agent's getting worse at billing questions, but we can't tell when it changed." Nobody had touched the agent code. The model deployment hadn't been bumped. Yet the gut feeling was real: agent quality had drifted, and we had no way to see when or by how much.

The fix is the pattern Microsoft formalised as Prompt Flow evaluation pipelines wired into GitHub Actions: a frozen evaluation set, an evaluation flow that scores agent responses against the set, scheduled runs that produce score-over-time charts, and PR-time runs that catch regressions before they ship. After the rebuild, the same drift would have shown up as a 9-point drop on the daily Slack alert; the team would have known about it the next morning.

This post is the entire build. By the end you have a Prompt Flow evaluation flow that scores Foundry agent responses on three dimensions (groundedness, task completion, tone), a 50-conversation gold-set fixture, a GitHub Actions workflow that runs nightly + on every PR touching the agent, an Application Insights dashboard that tracks scores over time, and a Slack alert that fires when any dimension drops by more than 5 points. About 300 lines of Python plus the YAML, with a clear separation between the evaluators (reusable across agents) and the gold set (per-agent).

Why Prompt Flow + GitHub Actions, and not just "test the agent"

Brief context because the question of "why this whole pipeline instead of unit tests" deserves a real answer.

Why agent quality needs more than unit tests. Agent outputs are stochastic. The same input produces slightly different outputs across runs. Testing for exact-match output is the wrong thing; testing whether the output meets quality thresholds is the right thing. That requires evaluators that can grade outputs subjectively (groundedness, tone) and aggregate across many runs.

Why Prompt Flow specifically. Microsoft's Prompt Flow gives you a directed graph for evaluation: input → run agent → score on dimension A, dimension B, dimension C → emit metrics. The graph is reusable across agents (the evaluators don't change; only the input dataset and the agent under test do). The flow runs both locally during development and as a step in any CI system. Prompt Flow also has a managed runtime in Foundry, so the same flow can be triggered from the Foundry UI for ad-hoc evaluation.

Why GitHub Actions, not the CI system you happen to have. Most teams I've shipped this for use GitHub Actions because that's where their agent code already lives. The pattern is identical for Azure DevOps, GitLab, etc. The trigger that matters is "run the eval flow on a schedule and on every PR that touches the agent."

Why gold set per agent. Reusable evaluators are great. Reusable gold sets are not. The gold set encodes what this agent is supposed to do; an agent for billing questions has a different gold set from an agent for technical support. Per-agent gold sets keep the evaluation honest.

What you'll have at the end

~/foundry-eval-pipeline/
├── flow/
│   ├── flow.dag.yaml                       # the Prompt Flow graph
│   ├── run-agent.py                        # invokes the Foundry agent
│   ├── evaluators/
│   │   ├── groundedness.py
│   │   ├── task-completion.py
│   │   └── tone.py
│   └── requirements.txt
├── data/
│   └── gold-set.jsonl                      # 50 conversation fixtures
├── .github/
│   └── workflows/
│       ├── eval-pr.yml                     # PR-time evaluation
│       └── eval-nightly.yml                # scheduled
├── alerts/
│   └── slack-on-regression.kql             # KQL alert
└── README.md

Prerequisites

  • A Microsoft Foundry project with a deployed agent. The build assumes you've already shipped one and want to add evaluation. → What is Microsoft Foundry?
  • A judge model deployment in Foundry — a gpt-4o deployment dedicated to evaluation. The judge runs the LLM-as-judge evaluations; keep it on a separate deployment from the production agent so eval traffic doesn't fight with user traffic for quota.
  • A GitHub repository for the agent code, where the workflows live. → Quickstart: Create a repository
  • An Application Insights workspace for the score telemetry → Create an Application Insights resource
  • Python 3.12+ with promptflow, azure-ai-projects, azure-identity, pytest packages.

python -m venv .venv && source .venv/bin/activate
pip install promptflow promptflow-tools azure-ai-projects azure-identity

az login

Step 1: The gold-set fixture

data/gold-set.jsonl:

{"id": "billing-001", "user_message": "What was the $40 charge on my last invoice?", "context": {"customer_id": "cust-12345"}, "expected_behavior": {"intent": "billing_lookup", "must_call_tool": ["query_orders"], "must_not_invent_amounts": true, "tone": "warm-direct"}}
{"id": "billing-002", "user_message": "How do I dispute a charge?", "context": {"customer_id": "cust-12345"}, "expected_behavior": {"intent": "process_question", "must_cite_doc": true, "tone": "warm-direct"}}
{"id": "product-001", "user_message": "Does the Pro plan support SSO?", "context": {"customer_id": "cust-12345"}, "expected_behavior": {"intent": "product_question", "must_cite_doc": true, "must_say_yes_or_no": true, "tone": "warm-direct"}}

Fifty entries spanning the agent's expected workload. Each entry has:

  • user_message: the customer's input
  • context: thread metadata the agent has access to
  • expected_behavior: what the response should look like, in dimensions the evaluators check

The shape of expected_behavior is what each evaluator reads. Keep it declarative; reasoning lives in the evaluator, not the gold-set entry.

Step 2: The Prompt Flow graph

flow/flow.dag.yaml:

$schema: https://azuremlschemas.azureedge.net/promptflow/latest/Flow.schema.json

inputs:
  user_message:
    type: string
  context:
    type: object
  expected_behavior:
    type: object

outputs:
  groundedness_score:
    type: number
    reference: ${groundedness.output.score}
  task_completion_score:
    type: number
    reference: ${task_completion.output.score}
  tone_score:
    type: number
    reference: ${tone.output.score}
  agent_response:
    type: string
    reference: ${run_agent.output.response}

nodes:
  - name: run_agent
    type: python
    source:
      type: code
      path: run-agent.py
    inputs:
      user_message: ${inputs.user_message}
      context: ${inputs.context}

  - name: groundedness
    type: python
    source:
      type: code
      path: evaluators/groundedness.py
    inputs:
      response: ${run_agent.output.response}
      expected_behavior: ${inputs.expected_behavior}

  - name: task_completion
    type: python
    source:
      type: code
      path: evaluators/task-completion.py
    inputs:
      response: ${run_agent.output.response}
      expected_behavior: ${inputs.expected_behavior}
      tools_called: ${run_agent.output.tools_called}

  - name: tone
    type: python
    source:
      type: code
      path: evaluators/tone.py
    inputs:
      response: ${run_agent.output.response}
      expected_behavior: ${inputs.expected_behavior}

What this graph does:

  • run_agent node invokes the agent under test on a single input from the gold set. Returns the agent's response and which tools were called.
  • Three evaluator nodes run in parallel against the agent's output, each scoring on one dimension.
  • outputs block emits the four scores (one per evaluator + the raw response) for downstream aggregation.

The graph is the contract. New evaluator? Add a node. New input? Add to the inputs block. The flow file is small and human-readable.

Step 3: The agent runner node

flow/run-agent.py:

"""Invoke the Foundry agent on a single gold-set entry. Return its response
and the list of tools it called."""
from promptflow import tool
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
import os

PROJECT_ENDPOINT = os.environ["FOUNDRY_PROJECT_ENDPOINT"]
AGENT_ID = os.environ["AGENT_ID_UNDER_TEST"]


@tool
def run_agent(user_message: str, context: dict) -> dict:
    """Run the agent against one user message. Return response + tools called."""
    project = AIProjectClient(endpoint=PROJECT_ENDPOINT, credential=DefaultAzureCredential())

    # Create an isolated thread per evaluation so runs don't bleed into each other
    thread = project.agents.threads.create(metadata=context)
    project.agents.messages.create(thread_id=thread.id, role="user", content=user_message)

    run = project.agents.runs.create_and_process(thread_id=thread.id, agent_id=AGENT_ID)

    # Pull the final assistant message
    msgs = list(project.agents.messages.list(thread_id=thread.id))
    response = next(m for m in msgs if m.role == "assistant").content[0].text.value

    # Inspect run steps to see which tools the agent invoked
    steps = list(project.agents.runs.steps.list(thread_id=thread.id, run_id=run.id))
    tools_called = []
    for step in steps:
        if step.type == "tool_calls":
            for call in step.step_details.tool_calls:
                tools_called.append(call.function.name if hasattr(call, 'function') else call.type)

    return {"response": response, "tools_called": tools_called}

A few non-obvious details:

  • @tool decorator marks this as a Prompt Flow node. The runtime injects inputs and reads outputs from the function signature.
  • Per-evaluation isolated thread — every gold-set entry gets a fresh thread, so cross-contamination between evaluations is impossible. Don't try to reuse threads to "save tokens"; the cleanliness is worth the cost.
  • Run steps inspection — the agent's tool-call history is in runs.steps, not in messages. This is the only way to verify the agent called the right tools, which is what task_completion evaluates.

Step 4: The groundedness evaluator

flow/evaluators/groundedness.py:

"""Score the agent's response for groundedness: does it stick to retrieved
content, or invent facts?"""
from promptflow import tool
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
import json
import os

token_provider = get_bearer_token_provider(
    DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"
)
judge = AzureOpenAI(
    azure_endpoint=os.environ["JUDGE_AOAI_ENDPOINT"],
    api_version="2024-10-21",
    azure_ad_token_provider=token_provider,
)
JUDGE_DEPLOYMENT = os.environ.get("JUDGE_DEPLOYMENT", "gpt-4o-judge")

JUDGE_PROMPT = """You are a strict evaluator. Score the assistant's response for GROUNDEDNESS.

GROUNDEDNESS means:
- The response cites or paraphrases information that exists in the conversation.
- The response does NOT invent specific facts (numbers, names, dates) that weren't in the inputs.
- If the response says "I don't have that information," that's a perfect groundedness score (no invented facts).

Return JSON only:
{
  "score": <integer 1 to 5; 5 = fully grounded, 1 = mostly invented>,
  "reasoning": "<one sentence>"
}

User message: {user_message}
Assistant response: {response}
Expected behavior: {expected_behavior}
"""


@tool
def evaluate_groundedness(response: str, expected_behavior: dict, user_message: str = "") -> dict:
    completion = judge.chat.completions.create(
        model=JUDGE_DEPLOYMENT,
        messages=[
            {
                "role": "system",
                "content": JUDGE_PROMPT.format(
                    user_message=user_message,
                    response=response,
                    expected_behavior=json.dumps(expected_behavior),
                ),
            }
        ],
        response_format={"type": "json_object"},
        temperature=0.0,
    )
    parsed = json.loads(completion.choices[0].message.content)
    return {"score": int(parsed["score"]), "reasoning": parsed["reasoning"]}

The pattern repeats for the other two evaluators (task-completion.py, tone.py) with different prompts.

A few choices worth dwelling on:

LLM-as-judge for these dimensions. Groundedness, task completion, and tone are subjective enough that programmatic checks would be too brittle. An LLM judge with a clear rubric and temperature=0.0 produces stable enough scores at the population level (run-to-run variance under 0.2 on a 1-5 scale) to be usable.

Strict 1-5 scale with anchored definitions. Five is "perfect"; one is "almost entirely wrong." The judge has a well-defined rubric, not a vibe. Without the rubric, the judge drifts toward giving everything a 4 or 5.

Separate judge model from production agent. The judge needs a gpt-4o (or better) deployment because evaluation tasks are harder than the agent's own work. The production agent might be on gpt-4o-mini for cost; the judge stays on gpt-4o for quality. Don't share quota — eval bursts during nightly runs would starve the production agent.

response_format={"type": "json_object"} forces the model to return valid JSON. Without this, the judge occasionally returns prose with the JSON in a markdown code block, which the parser then chokes on.

Step 5: The task-completion evaluator (the most complex one)

flow/evaluators/task-completion.py:

"""Score the agent's response for task completion: did it fulfil the user's
intent, and did it call the right tools?"""
from promptflow import tool
from openai import AzureOpenAI
import json, os

# (same judge setup as groundedness.py; abbreviated)
JUDGE_PROMPT = """Score the assistant's TASK COMPLETION.

TASK COMPLETION means:
- The response addresses what the user asked.
- If the expected_behavior specifies must_call_tool, the agent did call those tools.
- The response is not deflective ("I can't help with that") unless the request was genuinely unsupported.

Tools the agent called: {tools_called}
Expected tools: {must_call_tool}

Return JSON: {{ "score": 1-5, "reasoning": "..." }}

User: {user_message}
Response: {response}
Expected: {expected_behavior}
"""


@tool
def evaluate_task_completion(response: str, expected_behavior: dict, tools_called: list, user_message: str = "") -> dict:
    must_call = expected_behavior.get("must_call_tool", [])

    # Hard-fail on tools-called mismatch before LLM judging (cheaper signal first)
    if must_call and not all(t in tools_called for t in must_call):
        return {
            "score": 1,
            "reasoning": f"Did not call required tools. Expected {must_call}, got {tools_called}.",
        }

    completion = judge.chat.completions.create(
        model=JUDGE_DEPLOYMENT,
        messages=[{
            "role": "system",
            "content": JUDGE_PROMPT.format(
                user_message=user_message, response=response,
                tools_called=tools_called, must_call_tool=must_call,
                expected_behavior=json.dumps(expected_behavior),
            ),
        }],
        response_format={"type": "json_object"},
        temperature=0.0,
    )
    parsed = json.loads(completion.choices[0].message.content)
    return {"score": int(parsed["score"]), "reasoning": parsed["reasoning"]}

The hard-fail check before the LLM judge is the pattern that catches the 80% case for free. If the agent didn't call the required tools, no LLM judge is going to rescue the score. Filter cheaply first; LLM-judge is the expensive step that only runs when the deterministic checks pass.

Step 6: The Application Insights metric emit

After Prompt Flow runs, the scores need to land somewhere queryable. Push them to Application Insights as custom metrics:

# flow/aggregate.py — runs after the flow batch completes
import os
from azure.monitor.opentelemetry import configure_azure_monitor
from opentelemetry import metrics

configure_azure_monitor(connection_string=os.environ["APPLICATIONINSIGHTS_CONNECTION_STRING"])
meter = metrics.get_meter(__name__)

groundedness_hist = meter.create_histogram("eval.groundedness")
task_hist = meter.create_histogram("eval.task_completion")
tone_hist = meter.create_histogram("eval.tone")


def emit(results: list[dict], agent_id: str, run_id: str):
    """Push every per-entry score as a histogram observation."""
    for r in results:
        attrs = {"agent_id": agent_id, "run_id": run_id, "entry_id": r["id"]}
        groundedness_hist.record(r["groundedness_score"], attrs)
        task_hist.record(r["task_completion_score"], attrs)
        tone_hist.record(r["tone_score"], attrs)

The histogram metric type lets you query distributions later (p50, p95, mean) rather than only averages. For agent quality this matters — a high-mean score with a long tail of 1s and 2s is a different signal than a uniform 4.

Step 7: The PR-time GitHub Actions workflow

.github/workflows/eval-pr.yml:

name: agent-eval-pr
on:
  pull_request:
    paths:
      - 'agent/**'
      - 'flow/**'
      - 'data/gold-set.jsonl'

permissions:
  id-token: write
  contents: read
  pull-requests: write

jobs:
  eval:
    runs-on: ubuntu-latest
    timeout-minutes: 30
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.12' }

      - run: pip install -r flow/requirements.txt

      - uses: azure/login@v2
        with:
          client-id:       ${{ vars.AZURE_CLIENT_ID }}
          tenant-id:       ${{ vars.AZURE_TENANT_ID }}
          subscription-id: ${{ vars.AZURE_SUBSCRIPTION_ID }}

      - name: Run eval flow
        env:
          FOUNDRY_PROJECT_ENDPOINT: ${{ vars.FOUNDRY_PROJECT_ENDPOINT }}
          AGENT_ID_UNDER_TEST: ${{ vars.AGENT_ID }}
          JUDGE_AOAI_ENDPOINT: ${{ vars.JUDGE_AOAI_ENDPOINT }}
          APPLICATIONINSIGHTS_CONNECTION_STRING: ${{ vars.APPINSIGHTS_CS }}
        run: |
          pf run create \
            --flow flow/ \
            --data data/gold-set.jsonl \
            --column-mapping user_message='${data.user_message}' context='${data.context}' expected_behavior='${data.expected_behavior}' \
            --name "pr-${{ github.event.pull_request.number }}"

          # Compare to baseline (the score on main)
          python flow/compare-to-baseline.py \
            --pr-run "pr-${{ github.event.pull_request.number }}" \
            --baseline-tag main \
            --output report.md

      - name: Comment on PR
        uses: marocchino/sticky-pull-request-comment@v2
        with:
          path: report.md

The PR comment shows the score deltas vs main: which dimensions improved, which regressed, with the per-entry differences. Reviewers see the eval impact of the change inline.

Step 8: The nightly scheduled run + alert

.github/workflows/eval-nightly.yml:

name: agent-eval-nightly
on:
  schedule:
    - cron: '17 4 * * *'   # 04:17 UTC
  workflow_dispatch:

permissions:
  id-token: write
  contents: read

jobs:
  eval:
    runs-on: ubuntu-latest
    timeout-minutes: 45
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.12' }
      - run: pip install -r flow/requirements.txt
      - uses: azure/login@v2
        with: { client-id: ..., tenant-id: ..., subscription-id: ... }

      - name: Run nightly eval
        env: { ... same as PR workflow ... }
        run: |
          pf run create \
            --flow flow/ \
            --data data/gold-set.jsonl \
            --name "nightly-$(date +%Y%m%d)"

      # The KQL alert below will fire if scores regressed

The actual alert lives in App Insights as a saved KQL query plus an Action Group:

alerts/slack-on-regression.kql:

// Compare last night's scores to the trailing 7-day average; alert on >5pt regression
let recent = customMetrics
  | where timestamp > ago(1d)
  | where name in ('eval.groundedness', 'eval.task_completion', 'eval.tone')
  | summarize avg = avg(value) by name;
let baseline = customMetrics
  | where timestamp between (ago(8d) .. ago(1d))
  | where name in ('eval.groundedness', 'eval.task_completion', 'eval.tone')
  | summarize baseline_avg = avg(value) by name;

recent
| join kind=inner baseline on name
| extend delta = avg - baseline_avg
| where delta < -0.5  // 0.5 on a 1-5 scale = 10% drop
| project name, current = avg, baseline = baseline_avg, delta

Wired to an Azure Monitor Action Group that posts to Slack, the alert message looks like:

Agent eval regression eval.groundedness: current 3.6 (baseline 4.2) - dropped 0.6 Last 7 days: see https://portal.azure.com/...

The team's Slack channel sees this within 30 minutes of the nightly run completing. By morning standup, someone owns the investigation.

Step 9: A real regression that this caught

About four months into running this pipeline, the nightly run flagged a 0.7-point drop on task_completion. The investigation found that a teammate had merged a "small" prompt update three days earlier that subtly changed how the orchestrator picked between the billing tool and the knowledge tool. The change improved one type of question (5% recall lift on FAQ-style billing) but broke another type (15% drop on account-specific billing) because the orchestrator now over-routed to the knowledge agent.

The fix took an hour. The detection took zero, because the pipeline did it for us. Without the pipeline, the team would have heard about this from customer-success at the end of the week.

That's the whole product, in one example.

Production checklist

  1. Pin the judge model. The judge is part of your evaluation infrastructure. Treat its model deployment like a versioned API; pinning the model ensures cross-quarter score comparisons are valid.

  2. Re-validate the gold set quarterly. Some entries become stale (the right answer changes; the doc is moved). Tag every entry with last_validated; review entries older than 6 months.

  3. Run nightly eval on a fixed time that doesn't compete with production traffic. Weeknights at 04:00 UTC works for most teams.

  4. Cache the agent's response per gold-set entry within a single eval run. If three evaluators all score the same response, only call the agent once. Saves significant token cost on every nightly run.

  5. Alert thresholds should be tuned to actual variance. Your gold set's natural run-to-run variance might be 0.3 points on the 1-5 scale; a 0.5-point alert is right. Lower it to 0.3 and you'll get false alarms; raise it to 1.0 and you'll miss real regressions. Calibrate against historical data.

  6. Don't block PRs on small score deltas. PR-time eval is a signal, not a gate. A 0.1-point drop on one dimension might be within noise; it shouldn't block a merge. Reserve hard fails for dramatic drops (e.g., >1 point on any dimension).

Troubleshooting

Eval flow runs locally but fails in CI with "no module named promptflow." Almost always a missing pip install step. The pip install can't follow pyproject.toml if the workflow changed working directory.

Judge returns inconsistent scores across runs. The judge model is at temperature > 0. Fix to 0.0. Even at 0.0 there's some variance (~5% run-to-run); that's the floor.

Hard-fail on must_call_tool fires when the agent did call the right tool. The tools_called extraction missed it. Common cause: the agent called the tool through a connected agent, which appears in steps differently than direct calls. Update the run_agent.py extractor to recurse into connected-agent step details.

Nightly run takes 90 minutes for 50 gold-set entries. Each entry is sequential; parallelise. Prompt Flow has --connection-batch-size for parallel execution; set it to 5 for 5x speedup at the cost of more concurrent agent quota.

KQL alert never fires despite obvious regressions in the dashboard. The KQL query's time windows may not align with the run schedule. If your nightly runs at 04:17 UTC and the query uses ago(1d), edge cases at 04:00 might miss. Use ago(25h) for a small buffer.

Real-world references

The Microsoft Learn evaluation pages and the promptflow GitHub samples are the foundational references; everything in this article builds on those.

What this gives you, beyond the regression alert

The obvious win is regressions surface within hours instead of weeks. Agent quality stops being a feeling and starts being a metric. The team's confidence in shipping prompt changes goes up because there's a check that catches the obvious mistakes.

The less obvious win is what changes about how the team makes prompt-engineering decisions. Without the pipeline, "let's try this prompt change" is a debate with no resolution. With the pipeline, "let's try this prompt change" is an experiment with a measurable outcome. The team stops debating and starts measuring, which is the highest-leverage cultural change you can make on an AI workload.

The far-out win is what becomes possible when quality is observable. You can A/B-test prompt variants by routing some traffic through each and comparing scores. You can measure the impact of switching judge models. You can detect when a base model update degrades your agent (it happens). You can build a quality SLA for the agent that the org takes seriously.

A year into running this pipeline, the team I shipped this for has caught 19 regressions before they reached customers. Median time from regression-introduced to alert-fired is 14 hours. Average time to fix is 3 hours. That's the bill the 300 lines of Python paid for.

Prompt FlowGitHub ActionsContinuous Evaluation

Conversation

Reactions & comments

Liked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.

Loading conversation…

More from Azure AI Foundry

See all →