Contract-Testing an MCP Server: Fixtures, Golden Files, and the Harness That Catches Most Regressions

The MCP server we run for cost queries had a regression last quarter that nobody caught for nine days. The Cost Management API changed the shape of the properties.rows array (a fourth column appeared), our parser silently mapped the wrong fields, and the model started reporting cost figures off by a factor of 100. The numbers still looked plausible, just wrong.

We rebuilt the test harness around two ideas: fixture-driven contract tests for the upstream Azure APIs, and golden-file tests for the markdown the server returns to the model. Total runtime: 6 seconds. Total bugs caught in the last quarter: 14. This is the harness.

What's hard about testing MCP servers

Three problems no normal test framework solves out of the box:

The server's "API" is a JSON-RPC envelope over stdio. Calling listTools() and callTool() from a test needs the same SDK wiring the client uses.
Real upstream calls are flaky, slow, and require credentials. Mocking them is the obvious answer, but mocks drift from reality.
Tool outputs are markdown. Diffing markdown means diffing whitespace + table alignment, which is noisy.

The harness solves each.

The shape

tests/
  fixtures/
    cost_mgmt_query_2024_08_01__by_service.json   ← captured upstream response
    cost_mgmt_query_2024_08_01__empty.json
    cost_mgmt_query_2024_08_01__rate_limited.txt
  golden/
    cost_by_service__one_week.md                   ← expected tool output
    cost_by_service__empty.md
  contract/
    cost_management.spec.ts                        ← runs against real Azure (nightly)
  unit/
    cost_by_service.spec.ts                        ← runs against fixtures (PR)

Two tiers of tests: unit runs against captured fixtures on every PR, contract runs against real Azure on a nightly schedule. Mocks only earn their keep when the contract suite proves they still match.

Recording fixtures (one-time per upstream)

import { writeFileSync } from "node:fs";

if (process.env.RECORD_FIXTURES === "1") {
  const original = global.fetch;
  global.fetch = async (input, init) => {
    const res = await original(input, init);
    const body = await res.clone().text();
    const url = typeof input === "string" ? input : input.url;
    const slug = url
      .replace(/[^a-z0-9]+/gi, "_")
      .toLowerCase()
      .slice(0, 80);
    writeFileSync(`tests/fixtures/${slug}.json`, body);
    return res;
  };
}

Wrap fetch, run the test once with RECORD_FIXTURES=1 against a real subscription, files appear in tests/fixtures/. Commit them. Future runs read from disk.

Replaying fixtures in the unit tests

import { readFileSync } from "node:fs";
import { describe, it, expect, beforeEach, vi } from "vitest";
import { handleCostByService } from "../src/tools/cost.js";

beforeEach(() => {
  global.fetch = vi.fn(async (input) => {
    const url = typeof input === "string" ? input : input.url;
    if (url.includes("/CostManagement/query")) {
      return new Response(
        readFileSync("tests/fixtures/cost_mgmt_query_2024_08_01__by_service.json"),
        { status: 200, headers: { "content-type": "application/json" } }
      );
    }
    throw new Error(`unexpected fetch: ${url}`);
  });
});

describe("cost_by_service", () => {
  it("returns a markdown table with per-service rows", async () => {
    const result = await handleCostByService({
      subscriptionId: "00000000-0000-0000-0000-000000000000",
      from: "2026-04-01",
      to:   "2026-04-08",
    });
    expect(result.content[0].text).toMatchSnapshot();
  });

  it("handles a 429 with a retry-after hint", async () => {
    global.fetch = vi.fn(async () =>
      new Response("rate limited", {
        status: 429,
        headers: { "retry-after": "30" },
      })
    );
    const result = await handleCostByService(/* ... */);
    expect(result.isError).toBe(true);
    expect(result.content[0].text).toMatch(/retry in 30s/);
  });
});

expect(...).toMatchSnapshot() writes a golden file on first run, asserts equality on subsequent runs. The diff is the markdown, easier to review than asserting structured equality.

The golden-file pattern

Golden files live next to the tests; PR review is reviewing the output diff. When a markdown output changes:

- | Storage           | 1241.50 | USD |
+ | Storage           | 1241.45 | USD |

…the reviewer can decide whether the new value is correct. This catches both the bug (we re-classified a resource and the cost moved) and the intent (we changed the rounding logic on purpose).

The contract suite (real Azure, nightly)

import { describe, it, expect } from "vitest";

const RUN_CONTRACT = process.env.CONTRACT === "1";
const skip = RUN_CONTRACT ? it : it.skip;

describe("cost_management contract", () => {
  skip("returns the row shape we parse against", async () => {
    const res = await callRealCostMgmt(/* small known subscription, narrow window */);
    expect(res.properties.columns.map((c: any) => c.name)).toEqual([
      "PreTaxCost",
      "ServiceName",
      "Currency",
    ]);
    expect(Array.isArray(res.properties.rows)).toBe(true);
    expect(res.properties.rows[0]).toHaveLength(3);
  });
});

This test is what catches the upstream schema change. vitest run --reporter=junit from a nightly Container Apps job, results to App Insights, an alert fires when the suite goes red.

The first time it caught something was eleven weeks in: Cost Mgmt added a Cost column alongside PreTaxCost and the previous rows-of-three contract assertion failed instantly. We had a fix in before the bad fixture made it into a regular release.

What broke first

Snapshot churn from non-deterministic ordering. The first version of cost_by_service returned services in API-natural order, which was sometimes by PreTaxCost desc, sometimes alphabetical depending on the upstream system's mood. Snapshot diffs were enormous. Fix: sort deterministically before formatting (tokens desc, service asc). Snapshots stabilised immediately.

new Date() in the formatter. Used to render "as of " in the output. That broke replay tests because the date moved every day. Inject a now() clock in the formatter and freeze it in tests:

function formatCost(rows, opts: { now?: () => Date } = {}) {
  const now = (opts.now ?? Date.now)();
  // ...
}

Coverage on the wrong layer. Initial tests covered the SDK request handlers, which are mostly glue, switch statements and arg validation. The actual logic that breaks is in the upstream parser. I now require unit-test coverage on tools/*.ts (the parsers + formatters), not on server.ts (the wiring). Coverage report excludes server.ts from thresholds.

Authentication in tests. DefaultAzureCredential slows test boot by ~1.5s while it tries IMDS, MSAL, and CLI in order. In tests, replace it with a stub that returns a fixed token:

vi.mock("@azure/identity", () => ({
  DefaultAzureCredential: class { async getToken() { return { token: "test", expiresOnTimestamp: 0 }; } },
}));

Test boot dropped from 1.8s to 0.2s.

What I'd cut

The it.skip(RUN_CONTRACT ? it : it.skip) gating. It works, but mixing contract and unit tests in the same file is the wrong factoring, they have different signals, different cadences, different failure remediation. Two top-level test trees, two CI jobs, two destinations for the failure alerts. Cleaner.

I would NOT mock DefaultAzureCredential in any production-adjacent code path. Mocks for unit tests are fine; mocks for staging integration tests have shipped subtle auth bugs more than once. Run those against real Azure, even if slower.

MCPTestingContract TestsVitest

Contract-Testing an MCP Server: Fixtures, Golden Files, and the Harness That Catches Most Regressions

What's hard about testing MCP servers

The shape

Recording fixtures (one-time per upstream)

Replaying fixtures in the unit tests

The golden-file pattern

The contract suite (real Azure, nightly)

What broke first

What I'd cut

Conversation

More from DevOps

Migrating Classic Release Pipelines to YAML, the Six-Week Phased Plan

Service Connection Vending With Workload Identity Federation, at Org Scale

Self-Hosted Azure DevOps Agents on AKS With KEDA Autoscaling