No. 25Azure AIDec 2, 202510 min read

Streaming Azure OpenAI Through APIM: Token Budgets, Per-Tenant Limits, and Not Breaking SSE

We hit a fourteen-thousand-dollar Azure OpenAI bill in three days because one tenant's misbehaving agent ran an unbounded chain. The bill stopped the moment we put APIM in front of AOAI.

By Damilola Onadeinde

Senior DevOps Engineer

Share on X LinkedIn

We hit a fourteen-thousand-dollar Azure OpenAI bill in three days because one tenant's misbehaving agent ran an unbounded chain. The bill stopped the moment we put APIM in front of AOAI. This is the policy chain that finally worked, the streaming gotcha that almost killed it, and the per-tenant cost attribution that prevents the next incident.

The shape that works

client ──▶ APIM (subscription key per tenant) ──▶ AOAI deployment (gpt-4o)
              │
              ├─ inbound:  identify tenant
              ├─ inbound:  rate-limit by tenant key
              ├─ inbound:  AOAI token-limit policy
              ├─ outbound: emit-metric (cost attribution)
              └─ on-error: structured 429 body

APIM does four things:

Tenant identification via Ocp-Apim-Subscription-Key
Per-tenant rate limiting (calls per minute)
Per-tenant token budget (tokens per minute), the policy that closed the gap
Cost attribution as App Insights metrics for FinOps

What it does NOT do: load-balancing across deployments, response caching, or stripping streaming. Those are separate concerns and belong in their own policy fragments.

Tenant identification

<inbound>
  <base />
  <set-variable name="tenantId" value="@(context.Subscription.Name)" />
  <choose>
    <when condition="@(string.IsNullOrEmpty((string)context.Variables["tenantId"]))">
      <return-response>
        <set-status code="401" reason="Unauthorized" />
        <set-body>{"error":"missing_tenant"}</set-body>
      </return-response>
    </when>
  </choose>
</inbound>

I use the APIM subscription Name as the tenant id. APIM subscriptions are per-tenant and scoped to the AOAI product; the key the customer sends in Ocp-Apim-Subscription-Key is the only credential they ever see. No JWT parsing, no header conventions to maintain.

Rate limiting by tenant

<rate-limit-by-key
  calls="60"
  renewal-period="60"
  counter-key="@((string)context.Variables["tenantId"])"
  remaining-calls-variable-name="remainingCalls"
/>
<set-header name="X-RateLimit-Remaining" exists-action="override">
  <value>@(context.Variables["remainingCalls"].ToString())</value>
</set-header>

Sixty calls/minute/tenant is generous for chat workloads. The remaining-calls-variable-name lets us emit X-RateLimit-Remaining, which the downstream UI uses to show "you're approaching your limit" warnings before the actual 429.

The token budget, the policy that saved $14k

<azure-openai-token-limit
  counter-key="@((string)context.Variables["tenantId"])"
  tokens-per-minute="20000"
  estimate-prompt-tokens="true"
  tokens-consumed-header-name="x-tokens-consumed"
  remaining-tokens-header-name="x-tokens-remaining"
/>

The native azure-openai-token-limit policy is the one that closes the gap a request-rate limit can't. It does two things rate-limit-by-key doesn't:

Estimates prompt tokens before the call, so a single 50K-token prompt is rejected up front instead of after AOAI has already consumed and charged for it.
Reads the actual usage from the response and decrements the per-tenant counter accordingly.

Without this, a single tenant could send sixty 60-token requests in a minute (well under the call limit) that produced 60K tokens of output each, and bill us for 3.6M tokens of work.

The streaming gotcha

Azure OpenAI streaming uses Server-Sent Events. APIM, by default, buffers the response body until it's complete. That kills streaming, the client waits, sees nothing, then gets the whole response at once.

The fix is one attribute on <forward-request>:

<forward-request buffer-response="false" timeout="120" />

That's it. Buffering off, chunked transfer flows through, the client sees tokens as they're generated. Verify with curl:

curl -N \
  -X POST "https://apim.example.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-10-21" \
  -H "Ocp-Apim-Subscription-Key: $KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "stream": true,
    "messages": [{"role":"user","content":"Count to ten slowly."}]
  }'

The -N is critical, curl buffers stdout by default. Without it you get the same "wait then dump" experience and you'll think APIM is still buffering when it isn't.

There's a second buffering layer to watch: any reverse proxy or CDN in front of APIM. Azure Front Door does NOT buffer SSE responses, but a misconfigured nginx in your dev environment will. If streaming works against APIM directly but not through your edge, that's where to look.

Per-tenant cost attribution

<outbound>
  <base />
  <emit-metric name="aoai_tokens_consumed"
               value="@((double)((JObject)context.Variables["responseBody"]).SelectToken("usage.total_tokens"))">
    <dimension name="tenant"     value="@((string)context.Variables["tenantId"])" />
    <dimension name="deployment" value="@(context.Request.MatchedParameters["deployment-id"])" />
    <dimension name="model"      value="gpt-4o" />
    <dimension name="kind"       value="@(((bool)context.Variables.GetValueOrDefault("isStreaming", false)) ? "stream" : "single")" />
  </emit-metric>
</outbound>

Four dimensions, tenant, deployment, model, kind, and the metric flows into Application Insights. From there, a saved KQL query produces the per-tenant cost report we put on a dashboard:

customMetrics
| where name == "aoai_tokens_consumed"
| extend tenant     = tostring(customDimensions.tenant)
| extend deployment = tostring(customDimensions.deployment)
| summarize tokens = sum(value) by tenant, deployment, bin(timestamp, 1d)
| extend dollars = case(
    deployment == "gpt-4o",       tokens * 5.0  / 1000000,
    deployment == "gpt-4o-mini",  tokens * 0.15 / 1000000,
    deployment startswith "o1",   tokens * 15.0 / 1000000,
    0.0)
| order by tokens desc

Pricing constants in the case. We export this to Power BI and to a daily Slack report that pages on-call when any tenant exceeds $50/day. That alert is what stops the next $14k incident.

What broke first (and the fix)

Streaming responses did not decrement the token budget. Early preview of azure-openai-token-limit only worked on non-streaming calls; the GA version handles both. If you're stuck on the preview, hand-roll the decrement by parsing the final SSE chunk's usage field, but use the GA policy if you can.

The token estimator was off by ~30% for chat-format prompts. It uses cl100k_base for gpt-4o, which is right for raw text but doesn't account for the chat-format wrapper tokens (<|im_start|>system<|im_sep|>..., etc.). I added 4 tokens per message, 2 per name, and 3 priming, matching OpenAI's documented overhead. Estimator was within 2% after that.

The 429 response body was empty. Clients couldn't tell a rate-limit reject from a token-budget reject from a missing-tenant reject, all three came back as bare 429. We replaced the default with a structured body that points at the actual cause:

ki-light:#24292e;--shiki-dark:#adbac7;--shiki-light-bg:#fff;--shiki-dark-bg:#22272e" tabindex="0">

<on-error> <choose> <when condition="@(context.Response.StatusCode == 429)"> <return-response> <set-status code="429" reason="Quota exceeded" /> <set-header name="Content-Type" exists-action="override"> <value>application/json</value> </set-header> <set-body>@{ var reason = context.Response.Headers.GetValueOrDefault("x-quota-reason", "rate_limit"); return JsonConvert.SerializeObject(new { error      = reason, tenant     = context.Variables["tenantId"], retryAfter = context.Response.Headers.GetValueOrDefault("Retry-After", "60") }); }</set-body> </return-response> </when> </choose> </on-error>
Customer support tickets dropped immediately. Half the tickets had been "your API is broken" when the actual answer was "you sent a 50K-token prompt and we rejected it before it cost anyone anything."
What I'd do differently
Skip Azure API Management Consumption tier, its policy execution is throttled enough that a streaming response with the token-limit policy adds 200 to 400ms of overhead per request. Standard tier is roughly $700/month and the overhead drops below 50ms. For production AOAI workloads, that's not optional.
I would not put a vector store behind the same APIM instance. The latency profile is different, the limits should be different, and the cost dimension is different (storage + RU vs tokens). Two products, one APIM, separate policy fragments.
If you only do one thing from this post: turn on azure-openai-token-limit with a generous-but-finite budget on every tenant. The first time it rejects something you would have paid for, it's already paid for itself.

APIMAzure OpenAIStreamingMulti-tenant

`Conversation`

Reactions & comments

Liked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.

Loading conversation…

 Older
Killing API Keys in Azure OpenAI: A Managed Identity + Entra ID Migration Done ProperlyNewer 
Build and Ship an Azure Cost MCP Server From Empty Folder to Container Apps in 60 Minutes

`More from Azure AI`

See all →


Azure AINov 29, 2025
Killing API Keys in Azure OpenAI: A Managed Identity + Entra ID Migration Done Properly
Six months ago we had 14 services calling Azure OpenAI. All of them used API keys. The keys were stored in Azure Key Vault, fetched at startup, and rotated quarterly via a manual process that "everyone knew was fragile."
9 min read


Azure AINov 8, 2025
Per-Tenant Cost Attribution for Azure OpenAI Traffic Using APIM emit-token-metric-policy
Six months ago I couldn't tell you which tenant of our SaaS application was costing us the most in Azure OpenAI tokens. The number on the bill was real but unattributable.
9 min read


Azure AIOct 25, 2025
Fine-Tuning a Llama Variant With KAITO on AKS, Then Stress-Testing the Inference Endpoint
KAITO (Kubernetes AI Toolchain Operator) on AKS is the smoothest path I've found for "I want to fine-tune a Llama variant on my domain data and serve it as an inference endpoint." The whole pipeline, node provisioning, training, d…
10 min read