No. 19Azure AINov 8, 20259 min read

Per-Tenant Cost Attribution for Azure OpenAI Traffic Using APIM emit-token-metric-policy

Six months ago I couldn't tell you which tenant of our SaaS application was costing us the most in Azure OpenAI tokens. The number on the bill was real but unattributable.

By Damilola Onadeinde

Microsoft MVP — DevOps

Share on X LinkedIn

Six months ago I couldn't tell you which tenant of our SaaS application was costing us the most in Azure OpenAI tokens. The number on the bill was real but unattributable. Today I can tell you, per tenant per day, prompt tokens, completion tokens, total cost, and which features each tenant uses most.

The mechanism is Azure API Management's emit-token-metric-policy, a small Postgres table, and a Grafana dashboard. About 200 lines of config and code total. Here's the setup and what it's surfaced.

Why attribution matters

Until you have it:

"Our AI costs are increasing" is a sentence with no actionable follow-up.
A noisy tenant can monopolize your AOAI quota without anyone noticing.
Pricing decisions are made on average cost, which is wrong for any tenant who isn't average.
Your sales team can't price-quote new customers because nobody knows the marginal cost.

We had all four of these problems. Now we don't.

The architecture

Customer requests
       |
       v
   APIM gateway (per-tenant API key or JWT)
       |
       +-- emit-token-metric-policy: writes to App Insights
       |
       v
   Azure OpenAI deployment

The APIM policy fires after every Azure OpenAI response, parses the token usage from the response, and emits a custom metric to Application Insights tagged with the tenant ID. From there it's just App Insights queries to attribute costs.

The APIM policy

<inbound>
  <base />
  <set-variable name="tenantId"
                value="@((string)context.Variables.GetValueOrDefault("tenantId", "unknown"))" />
</inbound>

<backend>
  <base />
</backend>

<outbound>
  <base />
  <azure-openai-emit-token-metric>
    <dimension name="tenantId" value="@((string)context.Variables["tenantId"])" />
    <dimension name="featureArea" value="@((string)context.Variables.GetValueOrDefault("featureArea", "default"))" />
    <dimension name="model" value="@((string)context.Variables.GetValueOrDefault("model", "unknown"))" />
  </azure-openai-emit-token-metric>
</outbound>

The azure-openai-emit-token-metric policy is the magic. It introspects the Azure OpenAI response, pulls the usage block (prompt_tokens, completion_tokens, total_tokens), and emits an App Insights custom metric with whatever dimensions you specify.

We tag each request with three dimensions:

tenantId — set in the inbound section by extracting from the JWT
featureArea — set by the upstream client based on which product surface is making the call
model — which AOAI deployment was hit

That's enough to slice the cost data three ways.

The cost calculation

The metric only gives us tokens. We have to multiply by price per token to get cost. Different deployments have different prices. We hard-coded the multipliers in a Logic App that runs hourly:

PRICING = {
    "gpt-4o": {"prompt": 5.0e-6, "completion": 15.0e-6},
    "gpt-4o-mini": {"prompt": 0.15e-6, "completion": 0.60e-6},
    "text-embedding-3-large": {"prompt": 0.13e-6, "completion": 0},
    "gpt-4-turbo": {"prompt": 10.0e-6, "completion": 30.0e-6},
}

def cost_for_event(event):
    p = PRICING.get(event.model, {"prompt": 0, "completion": 0})
    return (
        event.prompt_tokens * p["prompt"]
        + event.completion_tokens * p["completion"]
    )

The Logic App reads the App Insights metrics, applies the pricing, and writes per-tenant per-day rollups to a Postgres table. Grafana reads from Postgres.

The pricing table has to be kept current — we update it whenever AOAI pricing changes (about quarterly).

What it surfaced

One tenant was consuming 38% of our AOAI tokens despite being 4% of our revenue. They had a feature usage pattern that hit the AI deeply on every page load. We weren't charging for AI usage. The cost was eating their account margin entirely. Pricing conversation followed.

One feature was 12x more expensive than its neighbors. A "summarize this thread" feature used gpt-4o for summaries that gpt-4o-mini would have handled fine. Switched the model, cost dropped 95% on that feature, no measurable quality regression on our eval set.

A specific tenant integration was caching badly. Their integration sent the same prompt repeatedly because they weren't caching at their end. We added a deterministic-cache layer in our gateway with a 5-minute TTL for that specific tenant's traffic. Their cost dropped 60%, our infra cost dropped slightly, everyone happy.

Our internal teams were the second-most-expensive tenant. Engineers experimenting with the AOAI deployment for prototype work were generating real cost. Not a problem in itself — experimentation has value — but worth knowing. We added a separate "internal experimentation" deployment with a strict budget cap.

The dashboards

Three Grafana dashboards run from this data:

Tenant view. Daily cost per tenant, top 10 in a stack, plus a table with rank, daily cost, 7-day trend, and feature mix. Updated hourly.

Feature view. Daily cost per feature area, sorted. Used for product-level decisions about where to invest in optimization.

Anomaly view. Same data, but with a Z-score calculation that highlights tenants whose daily cost is more than 2σ above their 30-day baseline. Catches sudden spikes from integration bugs.

The anomaly view fires roughly once a week. About half of those firings are real issues (a customer integration in a loop, a bot scraping the API, etc.). The other half are legitimate increases — new feature adoption, customer growth — and get acknowledged-and-dismissed.

What broke during setup

Tenant ID extraction was harder than expected. Our JWTs include the tenant ID as a custom claim, but extracting it inside an APIM policy required some XSL-style manipulation. Took an afternoon to get right.

App Insights metric ingestion has a small delay. Metrics show up in the queryable space about 90 seconds after emission. For real-time dashboards this matters; we use 5-minute aggregation windows to smooth over the lag.

The Logic App that does pricing math hit a Postgres connection limit during a spike. We exceeded the connection-pool size during a Black Friday traffic spike for one of our customers. Switched to using PgBouncer in front of the Postgres. Not a complex fix; would have been better to have it from day 1.

What I'd add next

Per-tenant budget alerts. Right now we see usage, we don't enforce limits per tenant. We've talked about adding a token-quota system at the APIM layer (using the azure-openai-token-limit policy) per tenant. The challenge is the policy: at what point do you cut a tenant off versus rate-limit them versus alert and let them through?

Marginal-cost-per-feature reporting. The current dashboards show absolute cost per feature. We don't show "cost per active user of this feature." That second view would let product managers reason about feature ROI more cleanly.

I would NOT centralize all the per-tenant data in a single Grafana for cross-customer visibility. Show the customer-facing data in their own per-tenant dashboard; keep the cross-tenant view internal-only. Cost data is sensitive even within an organization.

The portable lesson

If you're running anything on AOAI for any kind of multi-tenant workload, you need attribution. The emit-token-metric-policy is the path. Once you can see who's costing what, conversations about pricing, optimization, and capacity get drastically easier.

The total setup is roughly an engineer-week. The first cost-saving conversation paid for it.

CostFinOpsMulti-tenantOpenAI

Conversation

Reactions & comments

Liked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.

Loading conversation…

More from Azure AI

See all →

Azure AINov 29, 2025

Killing API Keys in Azure OpenAI: A Managed Identity + Entra ID Migration Done Properly

Six months ago we had 14 services calling Azure OpenAI. All of them used API keys. The keys were stored in Azure Key Vault, fetched at startup, and rotated quarterly via a manual process that "everyone knew was fragile."

9 min read

Azure AIOct 25, 2025

Fine-Tuning a Llama Variant With KAITO on AKS, Then Stress-Testing the Inference Endpoint

KAITO (Kubernetes AI Toolchain Operator) on AKS is the smoothest path I've found for "I want to fine-tune a Llama variant on my domain data and serve it as an inference endpoint." The whole pipeline — node provisioning, training, …

10 min read

Azure AIOct 18, 2025

vLLM-on-AKS vs Azure OpenAI: Where the Cost Crossover Actually Sits at 1M Tokens/Day

"Self-host the model and save money" is one of those statements that's true at scale and false below it. The interesting question isn't whether self-hosting is cheaper — it's where the crossover point is for your specific workload.

11 min read