Skip to content
damionas
No. 25Azure AIDec 2, 202510 min read

Streaming Azure OpenAI Through APIM: Token Budgets, Per-Tenant Limits, and Not Breaking SSE

We hit a fourteen-thousand-dollar Azure OpenAI bill in three days because one tenant's misbehaving agent ran an unbounded chain. The bill stopped the moment we put APIM in front of AOAI.

We hit a fourteen-thousand-dollar Azure OpenAI bill in three days because one tenant's misbehaving agent ran an unbounded chain. The bill stopped the moment we put APIM in front of AOAI. This is the policy chain that finally worked, the streaming gotcha that almost killed it, and the per-tenant cost attribution that prevents the next incident.

The shape that works

client ──▶ APIM (subscription key per tenant) ──▶ AOAI deployment (gpt-4o)

              ├─ inbound:  identify tenant
              ├─ inbound:  rate-limit by tenant key
              ├─ inbound:  AOAI token-limit policy
              ├─ outbound: emit-metric (cost attribution)
              └─ on-error: structured 429 body

APIM does four things:

  1. Tenant identification via Ocp-Apim-Subscription-Key
  2. Per-tenant rate limiting (calls per minute)
  3. Per-tenant token budget (tokens per minute), the policy that closed the gap
  4. Cost attribution as App Insights metrics for FinOps

What it does NOT do: load-balancing across deployments, response caching, or stripping streaming. Those are separate concerns and belong in their own policy fragments.

Tenant identification

<inbound>
  <base />
  <set-variable name="tenantId" value="@(context.Subscription.Name)" />
  <choose>
    <when condition="@(string.IsNullOrEmpty((string)context.Variables["tenantId"]))">
      <return-response>
        <set-status code="401" reason="Unauthorized" />
        <set-body>{"error":"missing_tenant"}</set-body>
      </return-response>
    </when>
  </choose>
</inbound>

I use the APIM subscription Name as the tenant id. APIM subscriptions are per-tenant and scoped to the AOAI product; the key the customer sends in Ocp-Apim-Subscription-Key is the only credential they ever see. No JWT parsing, no header conventions to maintain.

Rate limiting by tenant

<rate-limit-by-key
  calls="60"
  renewal-period="60"
  counter-key="@((string)context.Variables["tenantId"])"
  remaining-calls-variable-name="remainingCalls"
/>
<set-header name="X-RateLimit-Remaining" exists-action="override">
  <value>@(context.Variables["remainingCalls"].ToString())</value>
</set-header>

Sixty calls/minute/tenant is generous for chat workloads. The remaining-calls-variable-name lets us emit X-RateLimit-Remaining, which the downstream UI uses to show "you're approaching your limit" warnings before the actual 429.

The token budget, the policy that saved $14k

<azure-openai-token-limit
  counter-key="@((string)context.Variables["tenantId"])"
  tokens-per-minute="20000"
  estimate-prompt-tokens="true"
  tokens-consumed-header-name="x-tokens-consumed"
  remaining-tokens-header-name="x-tokens-remaining"
/>

The native azure-openai-token-limit policy is the one that closes the gap a request-rate limit can't. It does two things rate-limit-by-key doesn't:

  • Estimates prompt tokens before the call, so a single 50K-token prompt is rejected up front instead of after AOAI has already consumed and charged for it.
  • Reads the actual usage from the response and decrements the per-tenant counter accordingly.

Without this, a single tenant could send sixty 60-token requests in a minute (well under the call limit) that produced 60K tokens of output each, and bill us for 3.6M tokens of work.

The streaming gotcha

Azure OpenAI streaming uses Server-Sent Events. APIM, by default, buffers the response body until it's complete. That kills streaming, the client waits, sees nothing, then gets the whole response at once.

The fix is one attribute on <forward-request>:

<forward-request buffer-response="false" timeout="120" />

That's it. Buffering off, chunked transfer flows through, the client sees tokens as they're generated. Verify with curl:

curl -N \
  -X POST "https://apim.example.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-10-21" \
  -H "Ocp-Apim-Subscription-Key: $KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "stream": true,
    "messages": [{"role":"user","content":"Count to ten slowly."}]
  }'

The -N is critical, curl buffers stdout by default. Without it you get the same "wait then dump" experience and you'll think APIM is still buffering when it isn't.

There's a second buffering layer to watch: any reverse proxy or CDN in front of APIM. Azure Front Door does NOT buffer SSE responses, but a misconfigured nginx in your dev environment will. If streaming works against APIM directly but not through your edge, that's where to look.

Per-tenant cost attribution

<outbound>
  <base />
  <emit-metric name="aoai_tokens_consumed"
               value="@((double)((JObject)context.Variables["responseBody"]).SelectToken("usage.total_tokens"))">
    <dimension name="tenant"     value="@((string)context.Variables["tenantId"])" />
    <dimension name="deployment" value="@(context.Request.MatchedParameters["deployment-id"])" />
    <dimension name="model"      value="gpt-4o" />
    <dimension name="kind"       value="@(((bool)context.Variables.GetValueOrDefault("isStreaming", false)) ? "stream" : "single")" />
  </emit-metric>
</outbound>

Four dimensions, tenant, deployment, model, kind, and the metric flows into Application Insights. From there, a saved KQL query produces the per-tenant cost report we put on a dashboard:

customMetrics
| where name == "aoai_tokens_consumed"
| extend tenant     = tostring(customDimensions.tenant)
| extend deployment = tostring(customDimensions.deployment)
| summarize tokens = sum(value) by tenant, deployment, bin(timestamp, 1d)
| extend dollars = case(
    deployment == "gpt-4o",       tokens * 5.0  / 1000000,
    deployment == "gpt-4o-mini",  tokens * 0.15 / 1000000,
    deployment startswith "o1",   tokens * 15.0 / 1000000,
    0.0)
| order by tokens desc

Pricing constants in the case. We export this to Power BI and to a daily Slack report that pages on-call when any tenant exceeds $50/day. That alert is what stops the next $14k incident.

What broke first (and the fix)

Streaming responses did not decrement the token budget. Early preview of azure-openai-token-limit only worked on non-streaming calls; the GA version handles both. If you're stuck on the preview, hand-roll the decrement by parsing the final SSE chunk's usage field, but use the GA policy if you can.

The token estimator was off by ~30% for chat-format prompts. It uses cl100k_base for gpt-4o, which is right for raw text but doesn't account for the chat-format wrapper tokens (<|im_start|>system<|im_sep|>..., etc.). I added 4 tokens per message, 2 per name, and 3 priming, matching OpenAI's documented overhead. Estimator was within 2% after that.

The 429 response body was empty. Clients couldn't tell a rate-limit reject from a token-budget reject from a missing-tenant reject, all three came back as bare 429. We replaced the default with a structured body that points at the actual cause:

<on-error>
  <choose>
    <when condition="@(context.Response.StatusCode == 429)">
      <return-response>
        <set-status code="429" reason="Quota exceeded" />
        <set-header name="Content-Type" exists-action="override">
          <value>application/json</value>
        </set-header>
        <set-body>@{
          var reason = context.Response.Headers.GetValueOrDefault("x-quota-reason", "rate_limit");
          return JsonConvert.SerializeObject(new {
            error      = reason,
            tenant     = context.Variables["tenantId"],
            retryAfter = context.Response.Headers.GetValueOrDefault("Retry-After", "60")
          });
        }</set-body>
      </return-response>
    </when>
  </choose>
</on-error>

Customer support tickets dropped immediately. Half the tickets had been "your API is broken" when the actual answer was "you sent a 50K-token prompt and we rejected it before it cost anyone anything."

What I'd do differently

Skip Azure API Management Consumption tier, its policy execution is throttled enough that a streaming response with the token-limit policy adds 200 to 400ms of overhead per request. Standard tier is roughly $700/month and the overhead drops below 50ms. For production AOAI workloads, that's not optional.

I would not put a vector store behind the same APIM instance. The latency profile is different, the limits should be different, and the cost dimension is different (storage + RU vs tokens). Two products, one APIM, separate policy fragments.

If you only do one thing from this post: turn on azure-openai-token-limit with a generous-but-finite budget on every tenant. The first time it rejects something you would have paid for, it's already paid for itself.

APIMAzure OpenAIStreamingMulti-tenant

Conversation

Reactions & comments

Liked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.

Loading conversation…

More from Azure AI

See all →