End-to-End Observability for Azure AI Foundry Agents With OpenTelemetry and Application Insights

A production Foundry agent we ran for an internal customer started returning slow responses for one tenant on a Tuesday afternoon. Median latency for everyone else was steady at 1.4 seconds; for tenant acme-corp it was 11 seconds. We had no agent traces, no per-tool timing, no token-budget telemetry — just chat completions floating up to Application Insights as request logs that said "POST /chat 200 in 11s."

I spent four hours that afternoon working backwards from those 11-second logs. The slow tenant was hitting an Azure AI Search retrieval that pulled six times more candidate documents than other tenants because their content collection was unexpectedly large. Once we knew that, the fix was thirty minutes. The four hours was the cost of having no observability.

I rebuilt the agent host's instrumentation from scratch the next day. By the end of the week, every agent span was visible in App Insights, every tool call had p50/p95 timing, and every token consumed was attributed to a tenant and a model deployment. The next time something went weird, the investigation took eight minutes from page to root cause.

This post is the entire build. By the end you have an Azure AI Foundry agent host instrumented with OpenTelemetry, with custom spans for tool calls, token usage as a metric, trace correlation across agent-to-tool-to-Foundry hops, a sampling strategy that holds up at scale, Application Insights as the OTel backend, and KQL queries that turn the raw spans into the dashboards your on-call actually opens at 3am.

Why OpenTelemetry, and not the Foundry SDK's built-in trace export

Quick context because this is the choice every team running Foundry agents has to make on day one.

The Foundry SDK ships its own tracing module that exports to Application Insights through the Azure Monitor exporter. It works. It's also locked into a span shape that prioritises Foundry-internal concerns (model version, prompt template ID, content-filter result) over the questions you'll actually ask in production (which tenant's call was slow, which tool took the time, did this trace cross a service boundary). It's the right starting point and the wrong stopping point.

OpenTelemetry on top of the Foundry SDK gets you both. The Foundry-native spans are still emitted; you add your own custom spans alongside, all under the same trace ID, all flowing to the same backend. The investment is roughly a day of work and the payoff compounds for as long as the agent is in production.

Why Application Insights as the backend instead of Jaeger, Honeycomb, Datadog, or another OTel-compliant store: you almost certainly have an Application Insights workspace already, the OTel-to-AppInsights exporter is mature, and your existing on-call rotation already opens AppInsights when something goes wrong. The cost of standing up a separate observability stack is not paid back by an agent workload. If your org has standardised on Honeycomb or Datadog for everything, use that; the OTel side of this build is identical.

What you'll have at the end

~/foundry-agent-otel/
├── agent-host/
│   ├── src/
│   │   ├── server.ts
│   │   ├── agent.ts
│   │   ├── tools/
│   │   │   ├── search.ts
│   │   │   └── lookup.ts
│   │   ├── telemetry/
│   │   │   ├── tracer.ts
│   │   │   ├── metrics.ts
│   │   │   └── attributes.ts
│   │   └── middleware/
│   │       ├── tenant.ts
│   │       └── trace-context.ts
│   ├── tests/
│   │   └── span-assertions.test.ts
│   ├── Dockerfile
│   └── package.json
├── infra/
│   ├── app-insights.bicep
│   └── kql/
│       ├── tenant-latency.kql
│       ├── tool-budget.kql
│       └── token-cost-by-tenant.kql
├── dashboards/
│   └── foundry-agent-workbook.json
└── README.md

Prerequisites

node --version          # v22+
az --version            # 2.65+

You also need:

An Azure AI Foundry project with at least one agent deployed (this post uses a single agent, the pattern generalises)
An Application Insights workspace (we'll provision one if you don't have one)
Sufficient quota for the Foundry deployment to handle a couple of synthetic test calls

az login
az account set --subscription "<your-subscription-id>"

A note on Foundry SDK versions: the @azure/ai-foundry package's tracing surface stabilised in late 2025; if you're on an older preview, the span attributes have different names and the schema in the KQL queries below won't match. Pin to @azure/ai-foundry@1.x and check the span attribute reference at the top of telemetry/attributes.ts against your installed version.

Step 1: Provision Application Insights with workspace mode

infra/app-insights.bicep:

param location string = resourceGroup().location
param name string = 'ai-foundry-obs-${uniqueString(resourceGroup().id)}'
param retentionDays int = 90

resource workspace 'Microsoft.OperationalInsights/workspaces@2023-09-01' = {
  name: 'law-${name}'
  location: location
  properties: {
    sku: { name: 'PerGB2018' }
    retentionInDays: retentionDays
    features: {
      enableDataExport: true
    }
  }
}

resource appInsights 'Microsoft.Insights/components@2020-02-02' = {
  name: name
  location: location
  kind: 'web'
  properties: {
    Application_Type: 'web'
    WorkspaceResourceId: workspace.id
    SamplingPercentage: 100   // we'll do tail sampling in the SDK, not at AI ingest
    DisableIpMasking: false
  }
}

output connectionString string = appInsights.properties.ConnectionString
output workspaceId string = workspace.id
output workspaceCustomerId string = workspace.properties.customerId

Two non-obvious choices: workspace mode (instead of classic AI) because workspace mode lets you write KQL across both AI telemetry and other Log Analytics tables, which you'll want when correlating agent traces with the AKS or Container Apps logs underneath. And SamplingPercentage: 100 because we're doing all sampling in the SDK; AI ingest sampling is opaque and prone to dropping the spans you wanted to keep.

90 days retention is the sweet spot for an agent workload. Long enough to investigate "this happened last sprint"; short enough that it doesn't blow the Log Analytics ingest cost.

Step 2: The tracer initialisation

agent-host/src/telemetry/tracer.ts:

import { NodeSDK } from "@opentelemetry/sdk-node";
import { Resource } from "@opentelemetry/resources";
import { SemanticResourceAttributes } from "@opentelemetry/semantic-conventions";
import { AzureMonitorTraceExporter } from "@azure/monitor-opentelemetry-exporter";
import { ParentBasedSampler, TraceIdRatioBasedSampler } from "@opentelemetry/sdk-trace-base";

const SERVICE_NAME = "foundry-agent-host";
const SERVICE_VERSION = process.env.SERVICE_VERSION ?? "0.1.0";
const SAMPLE_RATE = Number(process.env.OTEL_SAMPLE_RATE ?? "0.1");

export function startTracing() {
  const sdk = new NodeSDK({
    resource: new Resource({
      [SemanticResourceAttributes.SERVICE_NAME]: SERVICE_NAME,
      [SemanticResourceAttributes.SERVICE_VERSION]: SERVICE_VERSION,
      [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.ENV ?? "dev",
    }),

    // Tail-aware sampler: errors are always kept, parent-based otherwise.
    sampler: new ParentBasedSampler({
      root: new TraceIdRatioBasedSampler(SAMPLE_RATE),
    }),

    traceExporter: new AzureMonitorTraceExporter({
      connectionString: process.env.APPLICATIONINSIGHTS_CONNECTION_STRING!,
    }),
  });

  sdk.start();

  process.on("SIGTERM", () => {
    sdk.shutdown().finally(() => process.exit(0));
  });
}

The ParentBasedSampler choice matters more than the head ratio does. In a multi-service trace (client → agent host → Foundry → tool → search index), you want a single decision per trace, made at the root, propagated downstream. Without ParentBasedSampler you'd get inconsistent fragments where some hops were sampled and others weren't, which makes the trace useless for the latency questions you actually have.

The 10% head ratio is the right starting point for production traffic. Drop it to 1% if your agent does more than ~50 RPS and the AppInsights ingest cost starts mattering. Bump it to 100% during the first week of a new deployment when you genuinely want every trace for debugging.

Step 3: Custom span attributes for the questions you'll ask

agent-host/src/telemetry/attributes.ts:

/**
 * Span attributes we set on every agent and tool span. Names are stable;
 * KQL queries in /infra/kql depend on them.
 */
export const ATTRS = {
  // Tenant identity (must be set on every root span)
  TENANT_ID: "ai.tenant.id",
  TENANT_TIER: "ai.tenant.tier",   // free | starter | pro | enterprise
  USER_OID: "ai.user.oid",

  // Agent identity
  AGENT_ID: "ai.agent.id",
  AGENT_VERSION: "ai.agent.version",

  // Model deployment (must be set on every Foundry call)
  MODEL_DEPLOYMENT: "ai.model.deployment",
  MODEL_NAME: "ai.model.name",       // gpt-4o | gpt-4o-mini | o1-mini

  // Token usage
  TOKENS_PROMPT: "ai.tokens.prompt",
  TOKENS_COMPLETION: "ai.tokens.completion",
  TOKENS_TOTAL: "ai.tokens.total",

  // Tool calls
  TOOL_NAME: "ai.tool.name",
  TOOL_OUTCOME: "ai.tool.outcome",   // success | error | timeout

  // Cost attribution
  COST_USD: "ai.cost.usd",

  // Streaming
  IS_STREAMING: "ai.streaming",
  TIME_TO_FIRST_TOKEN_MS: "ai.streaming.ttft_ms",
} as const;

Treat this file like a contract. Every span set in any tool implementation reads from this file; the KQL queries depend on these names; the dashboards depend on the KQL. If you rename ai.tenant.id later, you break every downstream query and the on-call dashboard goes blank.

That's not theoretical. I have shipped a "small refactor" that renamed two attribute keys and silently broke six dashboards. The KQL queries failed open (returned no rows) instead of closed, which means nobody noticed for three weeks. Now there's a CI check that verifies the names in this file match the names in the KQL files, which is the kind of test you only think to write after you've been bitten.

Step 4: The agent span wrapper

agent-host/src/agent.ts:

import { trace, SpanStatusCode, context, propagation } from "@opentelemetry/api";
import { ATTRS } from "./telemetry/attributes.js";
import { searchTool } from "./tools/search.js";
import { lookupTool } from "./tools/lookup.js";

const tracer = trace.getTracer("foundry-agent-host");

type Tenant = { id: string; tier: string };
type AgentRequest = {
  tenant: Tenant;
  userOid: string;
  agentId: string;
  agentVersion: string;
  prompt: string;
};

export async function runAgent(req: AgentRequest, foundryClient: any) {
  return tracer.startActiveSpan("agent.run", async (span) => {
    span.setAttributes({
      [ATTRS.TENANT_ID]: req.tenant.id,
      [ATTRS.TENANT_TIER]: req.tenant.tier,
      [ATTRS.USER_OID]: req.userOid,
      [ATTRS.AGENT_ID]: req.agentId,
      [ATTRS.AGENT_VERSION]: req.agentVersion,
    });

    try {
      const response = await foundryClient.runs.create({
        agent_id: req.agentId,
        thread: { messages: [{ role: "user", content: req.prompt }] },
        // Tools the agent can call. We wrap each so we can emit a child span.
        tool_handlers: {
          search: wrappedTool("search", searchTool),
          lookup: wrappedTool("lookup", lookupTool),
        },
      });

      // Foundry returns usage on the run object
      span.setAttributes({
        [ATTRS.MODEL_DEPLOYMENT]: response.model,
        [ATTRS.MODEL_NAME]: response.model,
        [ATTRS.TOKENS_PROMPT]: response.usage.prompt_tokens,
        [ATTRS.TOKENS_COMPLETION]: response.usage.completion_tokens,
        [ATTRS.TOKENS_TOTAL]: response.usage.total_tokens,
        [ATTRS.COST_USD]: estimateCostUsd(response.model, response.usage),
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return response;
    } catch (err: any) {
      span.recordException(err);
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      throw err;
    } finally {
      span.end();
    }
  });
}

function wrappedTool<T>(name: string, fn: (args: any) => Promise<T>) {
  return async (args: any) => {
    return tracer.startActiveSpan(`tool.${name}`, async (span) => {
      span.setAttribute(ATTRS.TOOL_NAME, name);
      try {
        const result = await fn(args);
        span.setAttribute(ATTRS.TOOL_OUTCOME, "success");
        span.setStatus({ code: SpanStatusCode.OK });
        return result;
      } catch (err: any) {
        span.setAttribute(ATTRS.TOOL_OUTCOME, "error");
        span.recordException(err);
        span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
        throw err;
      } finally {
        span.end();
      }
    });
  };
}

function estimateCostUsd(model: string, usage: any): number {
  const rates: Record<string, [number, number]> = {
    "gpt-4o":      [2.5, 10.0],
    "gpt-4o-mini": [0.15, 0.6],
    "o1-mini":     [3.0, 12.0],
  };
  const [inUsd, outUsd] = rates[model] ?? [0, 0];
  return (usage.prompt_tokens * inUsd + usage.completion_tokens * outUsd) / 1_000_000;
}

The wrappedTool pattern is the single most-leveraged piece of this build. Every tool the agent can call gets a child span, with name and outcome attributes that show up in the per-tool dashboards. Without this wrap, tool latency is invisible; with it, "which tool is slow for which tenant" is one KQL query away.

The cost-per-call attribute (ai.cost.usd) is what makes the FinOps story work. The same trace that shows latency also carries dollar figures, so the question "how much did tenant X spend on agent calls today" is answered by summing one attribute. Cost-per-call rates are pinned in code; bump them when Microsoft changes pricing (the rates here match the late-2025 published list).

Step 5: Tenant identity middleware

agent-host/src/middleware/tenant.ts:

import type { Request, Response, NextFunction } from "express";
import { trace } from "@opentelemetry/api";
import { ATTRS } from "../telemetry/attributes.js";

export function tenantContextMiddleware(req: Request, res: Response, next: NextFunction) {
  // Tenant ID is sourced from the validated JWT claim earlier in the chain.
  const tenantId = (req as any).claims?.tenantId;
  if (!tenantId) {
    return res.status(401).json({ error: "missing_tenant" });
  }
  const tenantTier = (req as any).claims?.tenantTier ?? "unknown";

  // Tag the active span (typically the HTTP request span) so every child
  // span inherits these attributes via lookup.
  const span = trace.getActiveSpan();
  if (span) {
    span.setAttributes({
      [ATTRS.TENANT_ID]: tenantId,
      [ATTRS.TENANT_TIER]: tenantTier,
    });
  }

  next();
}

The middleware tags the active HTTP request span, which is the parent of every agent span that follows. Children don't automatically inherit attributes from their parent, but the tags ensure the tenant identity is on every span where it matters (the agent root span explicitly re-sets them, but the HTTP-level span carries them too for queries that group on the entry point).

A subtle point: span attributes are not inherited automatically in OpenTelemetry. If you want tenant ID on a child span, you set it explicitly. Don't rely on parent-attribute lookup at query time; KQL parse operations across a long trace get expensive at scale.

Step 6: KQL queries for the questions you'll ask

infra/kql/tenant-latency.kql:

// p50 and p95 latency by tenant for the last hour
dependencies
| where timestamp > ago(1h)
| where name == "agent.run"
| extend tenant = tostring(customDimensions["ai.tenant.id"])
| extend tenantTier = tostring(customDimensions["ai.tenant.tier"])
| where isnotempty(tenant)
| summarize
    p50 = percentile(duration, 50),
    p95 = percentile(duration, 95),
    n = count()
  by tenant, tenantTier
| order by p95 desc

infra/kql/tool-budget.kql:

// Tool latency budget breakdown per agent run
dependencies
| where timestamp > ago(1h)
| where name startswith "tool."
| extend tool = tostring(customDimensions["ai.tool.name"])
| extend tenant = tostring(customDimensions["ai.tenant.id"])
| summarize
    p50 = percentile(duration, 50),
    p95 = percentile(duration, 95),
    p99 = percentile(duration, 99),
    n = count()
  by tool, tenant
| order by p95 desc

infra/kql/token-cost-by-tenant.kql:

// Daily cost by tenant, deployment, and model
dependencies
| where timestamp > ago(7d)
| where name == "agent.run"
| extend tenant = tostring(customDimensions["ai.tenant.id"])
| extend deployment = tostring(customDimensions["ai.model.deployment"])
| extend model = tostring(customDimensions["ai.model.name"])
| extend cost = todouble(customDimensions["ai.cost.usd"])
| summarize
    totalCost = sum(cost),
    callCount = count(),
    totalTokens = sum(toint(customDimensions["ai.tokens.total"]))
  by tenant, deployment, model, bin(timestamp, 1d)
| order by tenant asc, timestamp desc

These three queries answer 90% of the on-call questions for an agent workload: "is this tenant slow", "which tool ate the latency budget", and "what did we spend." Save them as workbook queries, build a dashboard with three charts, give the on-call rotation the workbook URL.

Step 7: A regression test for the span shape

agent-host/tests/span-assertions.test.ts:

import { describe, it, expect, beforeAll, afterAll } from "vitest";
import { InMemorySpanExporter, SimpleSpanProcessor } from "@opentelemetry/sdk-trace-base";
import { NodeTracerProvider } from "@opentelemetry/sdk-trace-node";
import { trace } from "@opentelemetry/api";
import { runAgent } from "../src/agent.js";

const exporter = new InMemorySpanExporter();
const provider = new NodeTracerProvider();
provider.addSpanProcessor(new SimpleSpanProcessor(exporter));

beforeAll(() => provider.register());
afterAll(() => provider.shutdown());

describe("agent span shape", () => {
  it("emits a span named agent.run with required tenant attributes", async () => {
    const fakeFoundry = {
      runs: {
        create: async () => ({
          model: "gpt-4o-mini",
          usage: { prompt_tokens: 120, completion_tokens: 80, total_tokens: 200 },
        }),
      },
    };

    await runAgent(
      {
        tenant: { id: "acme", tier: "pro" },
        userOid: "user-123",
        agentId: "agent-001",
        agentVersion: "1.4.0",
        prompt: "hi",
      },
      fakeFoundry,
    );

    const spans = exporter.getFinishedSpans();
    const root = spans.find((s) => s.name === "agent.run");
    expect(root).toBeDefined();
    expect(root!.attributes["ai.tenant.id"]).toBe("acme");
    expect(root!.attributes["ai.tokens.total"]).toBe(200);
    expect(root!.attributes["ai.cost.usd"]).toBeGreaterThan(0);
  });
});

The InMemorySpanExporter makes span shape testable without touching App Insights. Run this in CI on every PR; if the agent code changes in a way that breaks span attributes, the test catches it before the dashboards go blank.

This is the test I should have had when I shipped the rename that broke six dashboards. Now I always have it.

Step 8: The trace correlation across service boundaries

agent-host/src/middleware/trace-context.ts:

import type { Request, Response, NextFunction } from "express";
import { context, trace, propagation, SpanStatusCode } from "@opentelemetry/api";

/**
 * Extract incoming W3C traceparent and continue the trace.
 * Required for end-to-end correlation when the agent is called by another service.
 */
export function traceContextMiddleware(req: Request, res: Response, next: NextFunction) {
  const ctx = propagation.extract(context.active(), req.headers);
  context.with(ctx, () => {
    next();
  });
}

The traceparent HTTP header is the W3C standard that lets a trace span service boundaries. With this middleware, a client that calls the agent host with traceparent: 00-<traceid>-<spanid>-01 keeps its trace ID through the agent run, all the way down to Foundry's outbound calls (Foundry honours W3C trace context too, as of late 2025).

The result: a single trace ID that goes from the user's browser, through the API gateway, through the agent host, through every tool call, into Foundry, and back out. In App Insights you click the trace and see the whole journey on one transaction-search view. Cross-service investigation drops from "guess the request from logs" to "click the trace, see the path."

Without this header propagation, every service has its own trace island, and you reconstruct the journey by joining timestamps across log streams. That works at small scale and falls apart fast.

Production checklist

Pin the OpenTelemetry SDK version in package.json and audit deltas on upgrades. The semantic conventions for AI workloads are stabilising in 2026 but still moving; an upgrade can shift attribute names if you're not careful.
Set a minimum sampling floor for error spans. The default ParentBasedSampler(TraceIdRatioBasedSampler(0.1)) will drop 90% of error spans too. Use a tail-based sampler or a custom processor that keeps any span where status.code === ERROR, regardless of head sampling.
Don't log prompts or completions as span attributes. Span attributes are stored in App Insights' customDimensions field, which is queryable, exportable, and visible to anyone with Reader on the workspace. Customer prompt content does not belong there. Log span IDs and let the prompt itself flow to a separate, access-controlled store.
Watch the App Insights ingest cost. Each span at this attribute density costs roughly $0.000005 to ingest at default rates. At 100 RPS, 10% sampling, that's about $13/month per agent host. Sample less aggressively at low scale, more aggressively at high.
Document the attribute contract. The attributes.ts file is the source of truth; a CI step that diffs it against KQL files in /infra/kql/ will catch the rename-breaks-dashboards class of bug.
Build the workbook once, reuse forever. Save the KQL queries from Step 6 as a Workbook JSON, commit it to the repo, and provision the workbook via Bicep. Workbooks-as-code means a new Foundry agent project comes with a pre-built dashboard on day one.

Troubleshooting

No spans appearing in App Insights is almost always a connection-string problem. Confirm with console.log(process.env.APPLICATIONINSIGHTS_CONNECTION_STRING) at startup. If the env var is set but spans still don't appear, check that the exporter is initialised before the first span is created; race conditions between sdk.start() and your first request will silently drop spans.

Spans appear but customDimensions are empty happens when you set attributes on the span after it ends. The wrappedTool pattern above avoids this by using startActiveSpan and setting attributes inside the callback. Pull-request reviewers should look for any setAttribute outside an active-span block.

Trace is fragmented across services means trace context propagation is missing. Confirm that the traceContextMiddleware runs before any other span is created in the request, and that the calling service is sending a valid W3C traceparent header (00-<32 hex>-<16 hex>-01).

Sampling drops more than expected is a parent-based sampler with a low head ratio working as designed. The first child of a not-sampled root will inherit the not-sampled decision. To always keep error spans, use a custom sampler that overrides shouldSample based on a span attribute set during the call.

Cost dimension shows zero or wildly wrong values means the estimateCostUsd rates are out of date. Microsoft changes Foundry pricing periodically; pin a version of the rates table to a known reference and update intentionally.

KQL query returns no rows but spans exist means the customDimensions key names in the query don't match what your code is emitting. Check the actual key names with dependencies | where timestamp > ago(1h) | take 1 | project customDimensions.

What this gives you, and the part that matters at 3am

The obvious win is the dashboards. Three charts, three queries, every operational question about agent behaviour answered without a JIRA ticket to the platform team. p50 and p95 by tenant, tool latency budget, daily cost by tenant. That's the foundation.

The less obvious win is what changes about how the team investigates incidents. Before this build, agent issues were ten-minute Slack threads that turned into hour-long debugging sessions. After, they're trace clicks. Someone pages the on-call, the on-call opens the App Insights workbook, identifies the slow tenant, expands one trace, sees the offending tool span, knows which tool to look at, and either fixes it or pages the right team. Median time-to-first-action drops from twenty-something minutes to under five.

The further-along win is what becomes possible once observability is sharp. Continuous evaluation of agent quality requires per-call latency and per-call cost; without those, the eval harness can't tell you which prompt change made the agent slower or more expensive. Capacity planning for Foundry deployments requires per-tenant cost trends; without those, you guess at TPM caps. SLAs require per-tenant p95 with confidence; without those, you're picking numbers from intuition.

For a team running one Foundry agent in production, this build is a day of work and a week of dashboard polish. For a team running ten, it's the foundation of how the platform operates. The 600 lines of TypeScript and 80 lines of KQL aren't the deliverable; the cultural shift to "we know what's happening" is the deliverable, and it's the difference between a Foundry team that scales and a Foundry team that gets paged on Tuesdays.

A year into running this stack, our team has resolved 47 production agent incidents. Median investigation time at the start was 23 minutes; at six months it was 9; at the year mark it's 4. The latency-spike investigation that started this whole project would have been a four-minute trace click instead of a four-hour postmortem. That's the bill the four hours paid for. Worth it.

Foundry AgentsOpenTelemetryApplication InsightsObservability