Skip to content
A
No. 22Azure AINov 29, 20259 min read

Killing API Keys in Azure OpenAI: A Managed Identity + Entra ID Migration Done Properly

Six months ago we had 14 services calling Azure OpenAI. All of them used API keys. The keys were stored in Azure Key Vault, fetched at startup, and rotated quarterly via a manual process that "everyone knew was fragile."

Six months ago we had 14 services calling Azure OpenAI. All of them used API keys. The keys were stored in Azure Key Vault, fetched at startup, and rotated quarterly via a manual process that "everyone knew was fragile."

Today none of them use API keys. They authenticate via managed identity against Entra ID. The migration took roughly three weeks of careful work, and on the way through I learned that there's a "right way" and a "marketing way" to do this. Here's the right way.

Why this matters

API keys in environment variables, even via Key Vault references, are one process dump away from being copied. They're copied between environments by humans. They get pasted into Slack messages. They show up in logs. They live in CI/CD secret stores forever.

Managed identity removes the secret entirely. The Azure SDK on a managed-identity-enabled compute resource (Container App, AKS pod with workload identity, App Service, Function, etc.) gets a short-lived token via the Azure Instance Metadata Service. The token is bound to the compute identity and scoped to whatever permissions you've granted that identity.

No secret to leak. No rotation to manage. No "oops I committed this." Just identity-based authentication.

The migration plan

Three phases:

Phase 1: Enable managed identity on every service. Grant each identity Cognitive Services OpenAI User role on the relevant Azure OpenAI deployment.

Phase 2: Update each service's code to support both authentication methods, controlled by a feature flag. Roll the flag environment-by-environment.

Phase 3: Remove the API key code path entirely. Rotate the API keys to random unused values to confirm nothing's still using them.

Each phase had its own pitfalls.

Phase 1: enabling managed identity

For Container Apps:

az containerapp identity assign --system-assigned -n my-app -g rg-prod

# Get the principal ID
PRINCIPAL_ID=$(az containerapp identity show -n my-app -g rg-prod --query principalId -o tsv)

# Grant the role
az role assignment create \
  --role "Cognitive Services OpenAI User" \
  --assignee-object-id $PRINCIPAL_ID \
  --assignee-principal-type ServicePrincipal \
  --scope $AOAI_RESOURCE_ID

For AKS pods, the equivalent uses workload identity:

# Federate a managed identity with the pod's service account
az identity federated-credential create \
  --name my-app-fc \
  --identity-name id-my-app-prod \
  --resource-group rg-prod \
  --issuer $AKS_OIDC_ISSUER \
  --subject "system:serviceaccount:default:my-app-sa"

# Grant the same role on the AOAI resource
az role assignment create \
  --role "Cognitive Services OpenAI User" \
  --assignee-object-id $IDENTITY_PRINCIPAL_ID \
  --assignee-principal-type ServicePrincipal \
  --scope $AOAI_RESOURCE_ID

The pod needs:

spec:
  serviceAccountName: my-app-sa
  containers:
  - name: app
    env:
    - name: AZURE_CLIENT_ID
      value: <federated identity's client ID>

For Functions and App Services it's a similar pattern via az functionapp identity assign / az webapp identity assign.

Gotcha that bit us in Phase 1: role assignments take a few minutes to propagate. We initially set up a service, immediately tried to use it with managed identity, got 401s, assumed the assignment was wrong. Wait 5 minutes. Try again. Works.

Phase 2: dual-mode code

The application code needs to support both API key (for the rollback path) and managed identity (the new path). Here's the pattern we used in Python:

import os
from azure.identity import DefaultAzureCredential
from openai import AzureOpenAI

USE_MANAGED_IDENTITY = os.getenv("USE_MANAGED_IDENTITY", "false").lower() == "true"

def make_client():
    endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
    if USE_MANAGED_IDENTITY:
        return AzureOpenAI(
            azure_endpoint=endpoint,
            azure_ad_token_provider=DefaultAzureCredential().get_token,
            api_version="2024-08-01-preview",
        )
    else:
        return AzureOpenAI(
            azure_endpoint=endpoint,
            api_key=os.environ["AZURE_OPENAI_API_KEY"],
            api_version="2024-08-01-preview",
        )

Equivalents exist in the .NET, Java, JavaScript, and Go SDKs. The pattern is the same: feature flag, dual paths, default to API key for now.

We rolled the flag to true in dev for a week, then staging for a week, then production in a graduated rollout. No incidents during the rollout.

Gotcha that bit us in Phase 2: DefaultAzureCredential tries multiple auth methods in order. In our local dev environments, it would first try the developer's logged-in az login credentials, then fall through. Some developers had logged in to Azure with credentials that didn't have Cognitive Services OpenAI User on the dev AOAI resource. Their local runs failed with 403s while CI succeeded.

Fix: in dev environments, set AZURE_TOKEN_CREDENTIALS=AzureCliCredential to force the SDK to use the local CLI's credentials, and ensure those credentials have the right role assignment on dev resources.

Phase 3: removing API keys entirely

After two weeks of stable managed-identity-only running in production, we removed the API key code path. One commit per service, well-tested, deployed in business hours.

Then the validation step: rotate the actual API keys on the Azure OpenAI resource to random values. If anything is still using API keys, it'll start failing immediately and loudly.

az cognitiveservices account keys regenerate \
  --name aoai-prod-eus \
  --resource-group rg-prod-eus \
  --key-name key1

We waited 24 hours. Nothing failed. Rotated key2 the next day. Nothing failed.

Two services that we'd missed in the migration showed up at this point:

  • A scheduled report generator that ran weekly. Hadn't fired during our two-week observation window.
  • A backup-data-pipeline cleanup job that ran monthly.

Neither was on the migration tracker. Both crashed at their next scheduled run. Migrated them in 30 minutes each. Lesson: low-frequency cron jobs need explicit attention during a migration like this.

What I'd do differently

Inventory first, harder. We thought we had 14 services. We actually had 16. Two were obscure enough that nobody was tracking them. Building a proper inventory by querying every Azure OpenAI resource for "what identities have requested tokens recently" via Activity Logs would have caught these.

Pre-rotate the keys early. If you wait until Phase 3 to rotate, you discover gaps the hard way. Better to rotate after you THINK migration is complete, then immediately roll back if you find dependents. The 5-minute role-propagation delay applies; budget for it.

Don't use system-assigned identity for shared resources. System-assigned identity is tied to a specific compute resource. If you tear it down and rebuild, the identity is new — role assignments have to be re-done. For AOAI consumers that get rebuilt occasionally (auto-scaling Container App revisions), use a user-assigned identity referenced by the resource. The identity persists across rebuilds.

The portable lesson

Killing API keys is a 3-week project for a typical team. The cost is real but bounded. The payoff is forever — every credential-rotation incident you don't have, every "my API key got into our git history" panic you don't have, every "rotate X by Friday" meeting you don't have.

Worth it. Do it.

Azure OpenAIManaged IdentitySecurity

Conversation

Reactions & comments

Liked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.

Loading conversation…

More from Azure AI

See all →