No. 41Azure AIOct 21, 202530 min read

Build a Production APIM Layer in Front of Azure OpenAI: Token Budgets, Streaming, Per-Tenant Cost

We hit a fourteen-thousand-dollar Azure OpenAI bill in three days because one tenant's misbehaving agent ran an unbounded chain. The bill stopped the moment we put APIM in front of AOAI.

By Damilola Onadeinde

Senior DevOps Engineer

Share on X LinkedIn

We hit a fourteen-thousand-dollar Azure OpenAI bill in three days because one tenant's misbehaving agent ran an unbounded chain. The bill stopped the moment we put APIM in front of AOAI. The setup that finally worked is the one I should have built on day one.

This post is the entire build. By the end you have an Azure API Management instance fronting an Azure OpenAI deployment, per-tenant subscription keys with token budgets (not just call rate limits), streaming pass-through that preserves Server-Sent Events without buffering, and per-tenant cost attribution flowing into Application Insights with a KQL dashboard that produces dollar figures by tenant. About 80 lines of Bicep, 60 lines of XML policy, and a smoke test that proves streaming actually streams.

The result: a fourteen-thousand-dollar incident becomes a seven-dollar throttled response, and the FinOps team has a per-tenant report by lunchtime.

Why a gateway, not "just call AOAI directly"

Brief context because this question is the first one engineers ask, and the answer determines whether the rest of the build makes sense.

Direct AOAI works when there's one tenant. The moment there are two, you have problems. AOAI's native rate limit is per-deployment, not per-tenant. Tenant A's bursty workload affects tenant B's latency. Tenant A's misbehaving agent affects tenant B's bill (because they share the deployment). And there's no way to tell which tenant called what without instrumenting every call site, which requires changing every consumer.

A gateway centralises three things you can't do at the AOAI level:

Per-tenant identity and rate limiting. Each tenant gets a subscription key; the gateway knows which tenant made each call.
Per-tenant token budgets. Hard ceiling on tokens consumed per minute per tenant. This is the one that prevents the $14k bill.
Per-tenant cost attribution. Metric per call dimensioned by tenant, model, deployment. FinOps gets a real dashboard instead of "Azure OpenAI: $14,000".

You don't get any of these at the AOAI level. You get them at the gateway level. The gateway is what makes a multi-tenant AOAI workload tenable.

What you'll have at the end

~/apim-aoai-gateway/
├── infra/
│   ├── main.bicep
│   ├── modules/
│   │   ├── aoai.bicep
│   │   ├── apim.bicep
│   │   ├── api.bicep
│   │   └── monitoring.bicep
│   └── policies/
│       ├── inbound.xml
│       ├── outbound.xml
│       └── on-error.xml
├── kql/
│   └── per-tenant-cost.kql
├── scripts/
│   ├── add-tenant.sh
│   └── smoke-stream.sh
└── README.md

Prerequisites

az --version            # 2.65+
jq --version
curl --version

You'll need:

An Azure subscription where the AOAI-allowed model SKUs are available in your region (eastus, eastus2, swedencentral are safe defaults)
Permission to create APIM, Azure OpenAI, and Log Analytics
Access approval for Azure OpenAI in your subscription (form at aka.ms/oaiaccess, usually instant for existing customers)

az login
az account set --subscription "<your-subscription-id>"

A practical aside on regional choice: APIM Standard SKU's deployment time varies wildly by region. eastus2 and swedencentral are the fastest in my experience (around 25 minutes for a fresh provision). westeurope has been over 45 minutes more than once. If you're testing this for the first time, pick the fastest region; you'll iterate the policy fragments many times and don't want to wait an hour each time the env recreates.

Step 1: Provision Azure OpenAI

infra/modules/aoai.bicep:

param location string = resourceGroup().location
param name string = 'aoai-${uniqueString(resourceGroup().id)}'
param modelName string = 'gpt-4o-mini'
param modelVersion string = '2024-07-18'
param deploymentName string = 'chat'
param tpmCap int = 50  // thousand tokens per minute

resource aoai 'Microsoft.CognitiveServices/accounts@2024-10-01' = {
  name: name
  location: location
  kind: 'OpenAI'
  sku: { name: 'S0' }
  properties: {
    customSubDomainName: name
    publicNetworkAccess: 'Enabled'
    networkAcls: { defaultAction: 'Allow' }
    disableLocalAuth: true   // force Entra ID auth from APIM
  }
}

resource deployment 'Microsoft.CognitiveServices/accounts/deployments@2024-10-01' = {
  parent: aoai
  name: deploymentName
  sku: { name: 'GlobalStandard', capacity: tpmCap }
  properties: {
    model: { format: 'OpenAI', name: modelName, version: modelVersion }
    raiPolicyName: 'Microsoft.DefaultV2'
    versionUpgradeOption: 'OnceCurrentVersionExpired'
  }
}

output endpoint string = aoai.properties.endpoint
output deploymentName string = deployment.name
output aoaiId string = aoai.id
output aoaiResourceName string = aoai.name

disableLocalAuth: true is the choice that matters most in this template. APIM authenticates to AOAI via managed identity; no API keys are ever issued. Customers authenticate to APIM via subscription keys; APIM handles the AOAI auth on their behalf. This is the only configuration that gives you "no AOAI keys exist anywhere" as a property.

The temptation to enable local auth "just for now, for testing" is real. Don't. Once it's enabled, someone will find a key, paste it in a script, and you'll spend a week migrating callers off it. Disable from day one and force everyone to go through APIM, including yourself.

Step 2: Provision APIM Standard

infra/modules/apim.bicep:

param location string = resourceGroup().location
param name string
param publisherEmail string
param publisherName string
param skuName string = 'StandardV2'
param skuCapacity int = 1
param appInsightsConnectionString string

resource apim 'Microsoft.ApiManagement/service@2024-05-01' = {
  name: name
  location: location
  identity: { type: 'SystemAssigned' }
  sku: {
    name: skuName
    capacity: skuCapacity
  }
  properties: {
    publisherEmail: publisherEmail
    publisherName: publisherName
  }
}

resource appInsightsLogger 'Microsoft.ApiManagement/service/loggers@2024-05-01' = {
  parent: apim
  name: 'app-insights'
  properties: {
    loggerType: 'applicationInsights'
    description: 'App Insights for AOAI traffic'
    credentials: {
      connectionString: appInsightsConnectionString
    }
    isBuffered: true
  }
}

output apimId string = apim.id
output apimName string = apim.name
output apimGatewayUrl string = apim.properties.gatewayUrl
output apimPrincipalId string = apim.identity.principalId
output appInsightsLoggerId string = appInsightsLogger.id

Use Standard, not Consumption. This is the trap most teams fall into and the most expensive one to undo.

Consumption tier APIM is cheap (around $30/month at low traffic) but throttles policy execution enough that streaming plus the token-limit policy adds 200 to 400ms of overhead per request. For an interactive chat workload that latency is unacceptable. Standard tier is roughly $700/month and the overhead drops below 50ms. If your AOAI workload generates more than maybe 100 calls per day, the cost difference is rounding error compared to user-facing latency.

The upgrade path from Consumption to Standard exists but requires a re-deploy, re-key, and a customer-facing migration. Pick Standard upfront. The "we'll start cheap" instinct produces six months of pain that you'll never recover.

Step 3: RBAC: APIM to AOAI

APIM_PRINCIPAL=$(az apim show -g $RG -n $APIM_NAME --query identity.principalId -o tsv)
AOAI_ID=$(az cognitiveservices account show -g $RG -n $AOAI_NAME --query id -o tsv)

az role assignment create \
  --assignee-object-id "$APIM_PRINCIPAL" --assignee-principal-type ServicePrincipal \
  --role "Cognitive Services OpenAI User" \
  --scope "$AOAI_ID"

The role is Cognitive Services OpenAI User, which gives the calling identity exactly the right to invoke inference endpoints, nothing else. It can't read configuration, can't list models, can't change deployments. Scoped to the AOAI account, not subscription. If you have multiple AOAI accounts and only some should be reachable through this APIM, this scoping enforces the boundary.

Step 4: The inbound policy

infra/policies/inbound.xml:

<inbound>
  <base />

  <!-- 1. Identify the tenant from the APIM subscription -->
  <set-variable name="tenantId" value="@(context.Subscription.Name)" />
  <choose>
    <when condition="@(string.IsNullOrEmpty((string)context.Variables["tenantId"]))">
      <return-response>
        <set-status code="401" reason="Unauthorized" />
        <set-body>{"error":"missing_tenant"}</set-body>
      </return-response>
    </when>
  </choose>

  <!-- 2. Rate limit: 60 calls/minute/tenant -->
  <rate-limit-by-key
    calls="60"
    renewal-period="60"
    counter-key="@((string)context.Variables["tenantId"])"
    remaining-calls-variable-name="remainingCalls"
  />

  <!-- 3. Token budget: 20K tokens/minute/tenant -->
  <azure-openai-token-limit
    counter-key="@((string)context.Variables["tenantId"])"
    tokens-per-minute="20000"
    estimate-prompt-tokens="true"
    tokens-consumed-header-name="x-tokens-consumed"
    remaining-tokens-header-name="x-tokens-remaining"
  />

  <!-- 4. Authenticate to AOAI with managed identity -->
  <authentication-managed-identity resource="https://cognitiveservices.azure.com" />

  <!-- 5. Strip any inbound api-key the customer might be sending -->
  <set-header name="api-key" exists-action="delete" />
</inbound>

The four lines that matter most.

<set-variable name="tenantId" value="@(context.Subscription.Name)" /> uses the APIM subscription Name as the tenant identifier. APIM subscriptions are per-tenant, scoped to the AOAI product; their key is what the customer sends. Clean, stable, no JWT parsing. The subscription Name is what you set when you create the subscription (tenant-acme, tenant-widgets-co), so it's a value you control.

The rate-limit-by-key is the conventional rate limit. Sixty calls per minute per tenant is generous for chat workloads. Adjust based on your tenants' traffic patterns; the defaults here are starting points, not gospel.

azure-openai-token-limit is the one that closes the $14k gap. It does two things rate-limit-by-key can't: estimates prompt tokens before the AOAI call (so a 50K-token prompt is rejected up front), and reads the actual usage from the response and decrements per tenant. A tenant could send 60 requests of 60 tokens each (well under any call limit) that produced 60K output tokens each (a runaway agent). Without this policy, that's 3.6M tokens of cost. With it, the budget hits at 20K per minute and the rest are rejected.

<authentication-managed-identity resource="..." /> swaps the call's auth from "whatever the customer sent" to "the APIM identity's token for cognitiveservices.azure.com". Because we set disableLocalAuth: true on the AOAI account, this is the only auth path that works.

Step 5: The streaming gotcha

<backend>
  <forward-request buffer-response="false" timeout="120" />
</backend>

Azure OpenAI streaming uses Server-Sent Events. APIM, by default, buffers the response body until it's complete. That kills streaming, the client waits, sees nothing, then gets the whole response at once.

The fix is one attribute: buffer-response="false". That single flag is the difference between "streaming works" and "streaming is silently broken". I've seen teams ship a gateway, declare it production-ready, and discover months later that streaming has been disabled the whole time because nobody noticed the response was arriving in one chunk after the model finished. The smoke test in step 8 is what catches this; run it on every policy change.

Step 6: The outbound policy and per-tenant cost attribution

infra/policies/outbound.xml:

<outbound>
  <base />

  <choose>
    <when condition="@(context.Response.Headers.GetValueOrDefault("Content-Type", "").Contains("text/event-stream"))">
      <set-variable name="isStreaming" value="@(true)" />
    </when>
    <otherwise>
      <set-variable name="responseBody" value="@(context.Response.Body.As<JObject>(preserveContent: true))" />
      <emit-metric
        name="aoai_tokens_consumed"
        value="@((double)((JObject)context.Variables["responseBody"]).SelectToken("usage.total_tokens"))"
      >
        <dimension name="tenant" value="@((string)context.Variables["tenantId"])" />
        <dimension name="deployment" value="@(context.Request.MatchedParameters["deployment-id"])" />
        <dimension name="model" value="gpt-4o-mini" />
        <dimension name="kind" value="single" />
      </emit-metric>
    </otherwise>
  </choose>

  <set-header name="X-RateLimit-Remaining" exists-action="override">
    <value>@(context.Variables["remainingCalls"]?.ToString() ?? "")</value>
  </set-header>
</outbound>

The Content-Type check is the streaming-vs-not-streaming branch. Streaming responses don't have a JSON body to read at the outbound stage; the token-limit policy already decremented from the SSE chunks. Non-streaming responses have a usage block in the body, which is what emit-metric reads.

Four dimensions on the metric: tenant, deployment, model, kind. These are the dimensions you'll group by in KQL. Fewer dimensions and the dashboard can't answer "spend by tenant by model"; more and the metric explodes cardinality and costs more to ingest. Four is the right number for AOAI in 2026.

Step 7: The on-error policy

infra/policies/on-error.xml:

ki-light:#24292e;--shiki-dark:#adbac7;--shiki-light-bg:#fff;--shiki-dark-bg:#22272e" tabindex="0">

<on-error> <base /> <choose> <when condition="@(context.Response.StatusCode == 429)"> <return-response> <set-status code="429" reason="Quota exceeded" /> <set-header name="Content-Type" exists-action="override"> <value>application/json</value> </set-header> <set-body>@{ var reason = context.Response.Headers.GetValueOrDefault("x-quota-reason", "rate_limit"); return JsonConvert.SerializeObject(new { error      = reason, tenant     = context.Variables["tenantId"], retryAfter = context.Response.Headers.GetValueOrDefault("Retry-After", "60") }); }</set-body> </return-response> </when> </choose> </on-error>
The structured 429 body is the difference between a customer support ticket and a customer who handled it themselves. Without this, a 429 returns an empty body and the customer has no idea whether they hit a rate limit, a token budget, or a missing-tenant error. With this, the body says exactly which limit was breached and how long to wait.
The bare 429 from APIM in our team's first iteration generated about a third of the support tickets. After the structured body, the rate dropped to single digits. Same throttling, same rejection rate, but the customer can act on the error without involving us.
Step 8: The streaming smoke test
scripts/smoke-stream.sh:
#!/usr/bin/env bash
set -euo pipefail

GATEWAY="${1:?usage: smoke-stream.sh <gateway-url> <key>}"
KEY="${2:?missing key}"

echo "=== non-streaming ==="
curl -fsS -X POST "$GATEWAY/openai/deployments/chat/chat/completions?api-version=2024-10-21" \
  -H "Ocp-Apim-Subscription-Key: $KEY" \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Say hi in five words."}]}' \
  | jq '.choices[0].message.content, .usage'

echo
echo "=== streaming ==="
echo "(if you see tokens trickle in, streaming works)"
curl -N -fsS -X POST "$GATEWAY/openai/deployments/chat/chat/completions?api-version=2024-10-21" \
  -H "Ocp-Apim-Subscription-Key: $KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "stream": true,
    "messages":[{"role":"user","content":"Count to ten slowly, one per line."}]
  }' \
  | head -50

GATEWAY=$(az apim show -g $RG -n $APIM --query gatewayUrl -o tsv)
./scripts/smoke-stream.sh "$GATEWAY" "<acme-key>"

You should see tokens trickle in over a couple of seconds. If you see the full response dumped at once, streaming isn't working, usually because buffer-response="false" is missing.
The -N on the streaming curl is critical. Without it, curl buffers stdout and you'll think APIM is buffering even when it isn't. This is the classic "blame the wrong layer" trap; verify the smoke test isn't lying to you before debugging the gateway.
Step 9: Add a tenant subscription
scripts/add-tenant.sh:
#!/usr/bin/env bash
set -euo pipefail

RG="${RG:?missing RG}"
APIM="${APIM:?missing APIM}"
TENANT="${1:?usage: add-tenant.sh <tenant-name>}"

RESPONSE=$(az apim subscription create \
  -g "$RG" -n "$APIM" \
  --sid "tenant-${TENANT}" \
  --display-name "$TENANT" \
  --product-id "aoai-tier-default" \
  --primary-key "$(openssl rand -hex 32)" \
  --secondary-key "$(openssl rand -hex 32)" \
  -o json)

KEY=$(echo "$RESPONSE" | jq -r '.primaryKey')
echo "Tenant: $TENANT"
echo "Subscription Key: $KEY"
echo "(handle this secret carefully, it doesn't print again)"

RG=rg-aoai-gateway APIM=apim-aoaigw-... ./scripts/add-tenant.sh acme
RG=rg-aoai-gateway APIM=apim-aoaigw-... ./scripts/add-tenant.sh widgets-co

Each subscription gets its own key, scoped to the AOAI tier product. The key is what the tenant uses to call APIM; the gateway turns that into "you are tenant acme" and applies the per-tenant policies.
Step 10: The KQL dashboard
kql/per-tenant-cost.kql:
customMetrics
| where name == "aoai_tokens_consumed"
| where timestamp > ago(7d)
| extend tenant     = tostring(customDimensions.tenant)
| extend deployment = tostring(customDimensions.deployment)
| extend model      = tostring(customDimensions.model)
| summarize tokens = sum(value) by tenant, deployment, bin(timestamp, 1d)
| extend dollars = case(
    deployment == "gpt-4o",       tokens * 5.0  / 1000000,
    deployment == "gpt-4o-mini",  tokens * 0.15 / 1000000,
    deployment startswith "o1",   tokens * 15.0 / 1000000,
    0.0)
| order by tenant asc, timestamp desc

Save it as a workbook query. Add a chart: "tenant by daily $". The chart goes on the FinOps wall and is the alarm-source for "tenant X spent over $50/day".
A daily Slack alert on threshold:
customMetrics
| where name == "aoai_tokens_consumed" and timestamp > ago(1d)
| extend tenant = tostring(customDimensions.tenant)
| extend model  = tostring(customDimensions.model)
| summarize tokens=sum(value) by tenant, model
| extend dollars = case(
    deployment == "gpt-4o",       tokens * 5.0  / 1000000,
    deployment == "gpt-4o-mini",  tokens * 0.15 / 1000000,
    0.0)
| where dollars > 50
| project tenant, model, dollars

Wire that to an Azure Monitor Action Group that posts to Slack. The first time it fires is the moment your gateway starts paying for itself in incident-prevention terms.
Production checklist

APIM private network. For production, deploy APIM in internal mode in a VNet, with Application Gateway in front for public ingress. Internal-only is overkill for a public AOAI gateway; mixed is the production sweet spot.
Pin the AOAI deployment SKU and capacity. The tpmCap in aoai.bicep is your hard ceiling. A misconfigured tenant policy can't burn more than this per minute.
Tag every Bicep resource. env, cost-center, owner. Cost Management groupings depend on this.
Rotate APIM subscription keys. APIM supports primary plus secondary keys; rotate one at a time, give clients 30 days to switch.
Add Defender for Cloud on the AOAI account. Alerts on anomalous traffic, prompt-injection patterns, jailbreak attempts. Roughly $5/month per account, worth it.

Troubleshooting
401 Access denied due to invalid subscription key, Subscription key wrong or APIM subscription is in cancelled state. Check az apim subscription show.
429 Token budget exceeded, Working as intended. The on-error policy returns the structured body; check the error field for token_budget_exceeded vs rate_limit.
Streaming dumps at the end instead of trickling, buffer-response="false" is missing or the on-error policy is buffering. Inspect with curl -v -N and look for Transfer-Encoding: chunked in the response headers.
emit-metric doesn't show up in App Insights, Logger isn't wired or the metric name is being rejected. Check az apim diagnostic show and confirm App Insights is the active logger for this API.
Token estimator is wrong, The azure-openai-token-limit policy uses cl100k_base for gpt-4o. Chat-format wrapper tokens (<|im_start|> etc.) add 4 to 7 tokens per message; if your prompts are tiny this rounds significantly. Either oversize your TPM budget by ~10% or set estimate-prompt-tokens="false" and accept after-the-fact decrement only.
AuthorizationFailed from APIM to AOAI, Managed identity role assignment hasn't propagated. Wait 5 minutes; if persistent, check az role assignment list --assignee $APIM_PRINCIPAL --scope $AOAI_ID.
What this gives you, beyond the obvious
You've built the multi-tenant safety net that turns "AOAI is sometimes scary" into "AOAI is well-behaved". Per-tenant rate limits, per-tenant token budgets, per-tenant cost attribution, structured 429 responses, streaming pass-through. The $14k incident cannot recur in the same shape; the worst case is a tenant hits their token budget and gets a 429 with a clear explanation.
The harder thing to articulate is what changes about how the team thinks about Azure OpenAI usage. Before this build, AOAI was a thing the team feared collectively, the credit card bill at the end of the month was the only feedback loop, and incidents were how you found out about misuse. After, AOAI is a thing the team manages with the same instruments it uses for any other multi-tenant service: per-tenant SLAs, per-tenant dashboards, per-tenant alerts. The fear goes away because the levers exist.
The cultural side is that a tenant's misbehaving agent stops being your problem. Before, a runaway agent at customer A burned your bill, your on-call, your morning. After, the agent hits its tenant's budget, the tenant's customer support gets the structured 429, the tenant fixes their agent. Your inbox stays clean. The tenant has the information to act, your gateway has the information to enforce, and the boundary is sharp.
For an internal-only AOAI workload (one company, multiple business units), the same arguments hold with "tenant" replaced by "team". The token budget per business unit is what prevents the data-science team's experiment from eating the support team's quota. Same pattern, same shape, same gateway.
The build is roughly a day of work for the first instance. Each subsequent tenant is a single add-tenant.sh invocation. The long-tail return on the day's investment is multiple incidents avoided, multiple support tickets prevented, and a FinOps dashboard you can actually justify to the CFO.

APIMAzure OpenAIStreamingFinOps

`Conversation`

Reactions & comments

Liked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.

Loading conversation…

 Older
vLLM-on-AKS vs Azure OpenAI: Where the Cost Crossover Actually Sits at 1M Tokens/DayNewer 
Fine-Tuning a Llama Variant With KAITO on AKS, Then Stress-Testing the Inference Endpoint

`More from Azure AI`

See all →


Azure AIDec 2, 2025
Streaming Azure OpenAI Through APIM: Token Budgets, Per-Tenant Limits, and Not Breaking SSE
We hit a fourteen-thousand-dollar Azure OpenAI bill in three days because one tenant's misbehaving agent ran an unbounded chain. The bill stopped the moment we put APIM in front of AOAI.
10 min read


Azure AINov 29, 2025
Killing API Keys in Azure OpenAI: A Managed Identity + Entra ID Migration Done Properly
Six months ago we had 14 services calling Azure OpenAI. All of them used API keys. The keys were stored in Azure Key Vault, fetched at startup, and rotated quarterly via a manual process that "everyone knew was fragile."
9 min read


Azure AINov 8, 2025
Per-Tenant Cost Attribution for Azure OpenAI Traffic Using APIM emit-token-metric-policy
Six months ago I couldn't tell you which tenant of our SaaS application was costing us the most in Azure OpenAI tokens. The number on the bill was real but unattributable.
9 min read