Stand Up a Production-Ready Internal MCP Server on Azure Container Apps With Workload Identity

The first version of our internal MCP server ran on a developer's laptop. It worked beautifully. Then they took a Friday off and the FinOps Slack channel filled with "MCP server unavailable" complaints by 11am, because the laptop had auto-locked, the SSE connections had dropped, and nobody else on the team had the env vars to reproduce it.

This is, broadly, the entire history of internal tooling. A single engineer builds a thing on their laptop. The team likes it. Suddenly the laptop is a piece of production infrastructure. The day the laptop closes is the day the team learns about single points of failure the hard way.

We moved to Azure Container Apps with workload identity federation. Same protocol, same code, no keys, scale to zero when no one's asking, real liveness and readiness probes, real diagnostic telemetry, and an SLA the team can think about. This post is the full hosting walkthrough. By the end you have an MCP server you can hand off to someone else and they can keep it running without owning the laptop it was built on.

If you've already got the application code from the cost-MCP tutorial, this is the platform engineering deep dive on the hosting half.

Why Container Apps and not AKS, Functions, or App Service

Quick detour because tooling choices have long tails.

App Service would work. It's the boring option. Why not it: the cheaper SKUs don't scale to zero, which means an idle MCP server is paying for a B1 instance forever. The plus side is that App Service has the most mature managed identity story; if you already use it for everything else, the consistency might outweigh the cost story.

Functions would also work. It scales to zero on the consumption tier, costs essentially nothing idle. Why not it: cold starts on consumption tier are 2 to 8 seconds, which is fine for a webhook but distinctly not fine for a query tool engineers use interactively. The premium tier solves cold starts but loses the scale-to-zero argument.

AKS would work and is what a few teams reach for. It's overkill. You'd pay for the node pool, manage Kubernetes upgrades, write a Helm chart, set up ingress, and at the end you'd have something behaviourally identical to a Container App. AKS is the right answer if you have other things on the cluster that share its lifecycle. A single MCP server on its own cluster is a maintenance burden.

Container Apps wins on three axes for this exact workload. Scale-to-zero with HTTP-based scaling, workload identity baked in (identity: { type: 'SystemAssigned' } and you're done), no node pool to babysit. Cold starts are around 4 to 6 seconds from zero, fast enough that "I'll wait" is the natural reaction. The right tool for the job, and "the right tool" matters more in 2026 than it used to, because the menu of platforms has gotten long enough that "use whatever we already have" is no longer a defensible default.

What you'll have at the end

~/mcp-on-aca/
├── infra/
│   ├── main.bicep
│   ├── modules/
│   │   ├── env.bicep
│   │   ├── app.bicep
│   │   ├── private-link.bicep
│   │   └── rbac.bicep
│   └── main.bicepparam
├── server/
│   └── Dockerfile
├── scripts/
│   ├── smoke.sh
│   └── tail-logs.sh
└── README.md

About 350 lines of Bicep, all parameterised, ready to drop into a real platform monorepo. The application code is whatever your MCP server happens to be; this lab is purely about the hosting story.

Prerequisites

az --version            # 2.65 or newer
az bicep version        # 0.30 or newer (run `az bicep upgrade` if older)
docker --version
jq --version            # for the smoke script

You'll need an Azure subscription where you can create resource groups, register resource providers, and assign roles at the subscription scope. Subscription-Owner is overkill but Contributor + Role-Based Access Control Administrator on the target RG is the practical minimum.

az login
az account set --subscription "<your-subscription-id>"

for p in Microsoft.App Microsoft.OperationalInsights Microsoft.Network \
         Microsoft.ContainerRegistry Microsoft.ManagedIdentity; do
  az provider register -n "$p" --wait
done

The provider registration loop is something most tutorials skip because they assume the providers are already on. New subscriptions don't have them, and you'll get an opaque "ResourceTypeNotFound" error if you skip this. Idempotent so it's safe to re-run.

Step 1: The Container Apps environment + Log Analytics

infra/modules/env.bicep:

param location string = resourceGroup().location
param name string
param vnetSubnetId string
param logRetentionDays int = 30

resource law 'Microsoft.OperationalInsights/workspaces@2023-09-01' = {
  name: 'law-${name}'
  location: location
  properties: {
    sku: { name: 'PerGB2018' }
    retentionInDays: logRetentionDays
  }
}

resource env 'Microsoft.App/managedEnvironments@2024-03-01' = {
  name: 'cae-${name}'
  location: location
  properties: {
    appLogsConfiguration: {
      destination: 'log-analytics'
      logAnalyticsConfiguration: {
        customerId: law.properties.customerId
        sharedKey: listKeys(law.id, '2023-09-01').primarySharedKey
      }
    }
    vnetConfiguration: {
      infrastructureSubnetId: vnetSubnetId
      internal: true
    }
    workloadProfiles: [
      { name: 'Consumption', workloadProfileType: 'Consumption' }
    ]
    zoneRedundant: false
  }
}

output environmentId string = env.id
output environmentName string = env.name
output staticIp string = env.properties.staticIp
output workspaceId string = law.id

Two non-obvious choices in this module, both worth understanding because they shape what the server can do.

internal: true gives the env a private IP only. No internet exposure, no Front Door wiring, reachable inside the corporate VNet over a private DNS zone Azure provisions automatically. For an internal MCP server, this is the right default. The temptation to expose it externally with "we'll add auth later" is the same temptation that produces internal services on the public internet behind a basic-auth header that nobody rotates. Internal-by-default is the contract you want with your future self.

Zone redundancy is off because for a small internal MCP server it doubles cost without buying anything an MCP client cares about. If a zone goes down, the worst case is engineers can't query cost data for a few minutes; this isn't tier-1 production traffic. Turn on zone redundancy when the workload genuinely needs it. Don't do it because the documentation suggests it as a default.

PerGB2018 for Log Analytics is the cheapest tier and the right one until you have enough log volume to justify a capacity reservation. For an MCP server emitting structured logs at maybe 50KB per query, you'll be on the free tier (5GB/month) for a long time.

Step 2: The Container App

infra/modules/app.bicep:

param location string = resourceGroup().location
param name string
param environmentId string
param image string
param targetPort int = 8080
param minReplicas int = 0
param maxReplicas int = 5
param workspaceId string

resource app 'Microsoft.App/containerApps@2024-03-01' = {
  name: 'ca-${name}'
  location: location
  identity: { type: 'SystemAssigned' }
  properties: {
    managedEnvironmentId: environmentId
    workloadProfileName: 'Consumption'
    configuration: {
      activeRevisionsMode: 'Single'
      ingress: {
        external: false
        targetPort: targetPort
        transport: 'auto'
        allowInsecure: false
        traffic: [{ latestRevision: true, weight: 100 }]
        stickySessions: { affinity: 'sticky' }
      }
    }
    template: {
      containers: [{
        name: 'mcp'
        image: image
        resources: { cpu: json('0.5'), memory: '1Gi' }
        env: [
          { name: 'NODE_ENV',      value: 'production' }
          { name: 'MCP_TRANSPORT', value: 'http' }
          { name: 'PORT',          value: string(targetPort) }
        ]
        probes: [
          {
            type: 'Liveness'
            httpGet: { path: '/healthz', port: targetPort }
            initialDelaySeconds: 5
            periodSeconds: 30
            failureThreshold: 3
          }
          {
            type: 'Readiness'
            httpGet: { path: '/healthz', port: targetPort }
            initialDelaySeconds: 2
            periodSeconds: 5
          }
        ]
      }]
      scale: {
        minReplicas: minReplicas
        maxReplicas: maxReplicas
        rules: [
          {
            name: 'http-rule'
            http: { metadata: { concurrentRequests: '20' } }
          }
        ]
      }
    }
  }
}

resource diag 'Microsoft.Insights/diagnosticSettings@2021-05-01-preview' = {
  scope: app
  name: 'baseline'
  properties: {
    workspaceId: workspaceId
    logs: [
      { category: 'ContainerAppConsoleLogs', enabled: true }
      { category: 'ContainerAppSystemLogs',  enabled: true }
    ]
  }
}

output principalId string = app.identity.principalId
output appName string = app.name
output fqdn string = app.properties.configuration.ingress.fqdn

Three choices that matter for production behaviour.

stickySessions.affinity: 'sticky' is the SSE survival flag. SSE clients hold a long-lived connection to one replica. If the load balancer routes the next message of the same session to a different replica, the new replica doesn't know about the session, the request 404s, the client experiences the server "dropping" them. Sticky affinity routes the same client back to the replica it started on. Without this, an MCP server scaled past one replica is a flaky MCP server.

activeRevisionsMode: 'Single' makes new revisions replace the old one wholesale. Multiple revisions enable blue/green at the cost of complexity (now you have two live versions, two log streams, two debugging contexts). Defer until you actually need it. For most internal tools, "Single" is right and "Multiple" is a footgun pretending to be a feature.

The HTTP scaler with concurrentRequests: '20' scales up when sustained concurrency exceeds 20 per replica. For most MCP workloads this lands at one or two replicas; bursts trigger more. The 20 is a tuning knob: lower it (say to 5) for verifying the scale rule fires, then raise it back up. The default scaler in older API versions was based on a different metric and behaved unpredictably; pinning to concurrentRequests makes the behaviour testable.

The diagnostic settings block at the bottom of the module is the one most teams forget because it lives in a different namespace (Microsoft.Insights) from the app. Without it, your logs go to a place you can't query. With it, console logs are queryable in Log Analytics within seconds and you can build the latency dashboard in step 11.

Step 3: RBAC for the system-assigned identity

infra/modules/rbac.bicep:

targetScope = 'subscription'

param principalId string
param roleAssignments array  // [{ scope: '...', roleDefinitionId: '...' }]

resource assignments 'Microsoft.Authorization/roleAssignments@2022-04-01' = [
  for (ra, i) in roleAssignments: {
    name: guid(principalId, ra.scope, ra.roleDefinitionId, string(i))
    properties: {
      principalId: principalId
      principalType: 'ServicePrincipal'
      roleDefinitionId: ra.roleDefinitionId
    }
    scope: tenantResourceId('Microsoft.Resources/subscriptions', subscription().subscriptionId)
  }
]

The module accepts a list of (scope, roleDefinitionId) pairs. Don't hardcode roles inside this module, each consumer module declares the roles its app needs. Keeps the blast radius for changes small, and makes the security review one file at a time.

A small cultural observation: this is the kind of module that, in a six-month-old codebase, has accumulated forty hardcoded roles for fifteen apps because someone copy-pasted it once and nobody refactored. The array parameter is what prevents that. If you can't articulate, on the consumer side, "this app needs Cost Management Reader at subscription scope and AcrPull on the registry", you don't yet understand what the app does. The Bicep is doing some of the security review for you.

Step 4: Private endpoint into a peered VNet

infra/modules/private-link.bicep:

param location string = resourceGroup().location
param environmentName string
param subnetId string
param privateDnsZoneId string

resource env 'Microsoft.App/managedEnvironments@2024-03-01' existing = {
  name: environmentName
}

resource pe 'Microsoft.Network/privateEndpoints@2024-05-01' = {
  name: 'pe-${environmentName}'
  location: location
  properties: {
    subnet: { id: subnetId }
    privateLinkServiceConnections: [
      {
        name: 'plsc-${environmentName}'
        properties: {
          privateLinkServiceId: env.id
          groupIds: ['managedEnvironments']
        }
      }
    ]
  }
}

resource zoneGroup 'Microsoft.Network/privateEndpoints/privateDnsZoneGroups@2024-05-01' = {
  parent: pe
  name: 'default'
  properties: {
    privateDnsZoneConfigs: [
      {
        name: 'aca-zone'
        properties: { privateDnsZoneId: privateDnsZoneId }
      }
    ]
  }
}

The private endpoint needs a private DNS zone for *.privatelink.<region>.azurecontainerapps.io. Either reuse one your platform team already owns, or create one in this stack. If you create it, link it to every VNet whose workloads will resolve the MCP server's FQDN.

The DNS story is where most "internal-only Container Apps" projects get stuck for a day. The private endpoint resolves to a private IP only via the private DNS zone; if the zone isn't linked to the workload's VNet, the FQDN won't resolve from there and the workload silently fails to reach the server. Test resolution from inside the workload's VNet, not from your laptop, before declaring the deploy successful. nslookup ca-mcp-prod.internal... from a VM in the right VNet is the unambiguous test.

Step 5: Wire the modules together

infra/main.bicep:

targetScope = 'resourceGroup'

param location string = resourceGroup().location
param name string
param image string
param vnetSubnetId string
param peSubnetId string
param privateDnsZoneId string

module envModule 'modules/env.bicep' = {
  name: 'env'
  params: {
    location: location
    name: name
    vnetSubnetId: vnetSubnetId
  }
}

module appModule 'modules/app.bicep' = {
  name: 'app'
  params: {
    location: location
    name: name
    environmentId: envModule.outputs.environmentId
    image: image
    workspaceId: envModule.outputs.workspaceId
  }
}

module peModule 'modules/private-link.bicep' = {
  name: 'pe'
  params: {
    location: location
    environmentName: envModule.outputs.environmentName
    subnetId: peSubnetId
    privateDnsZoneId: privateDnsZoneId
  }
}

module rbacModule 'modules/rbac.bicep' = {
  name: 'rbac'
  scope: subscription()
  params: {
    principalId: appModule.outputs.principalId
    roleAssignments: [
      {
        scope: subscription().id
        roleDefinitionId: subscriptionResourceId(
          'Microsoft.Authorization/roleDefinitions',
          '72fafb9e-0641-4937-9268-a91bfd8191a3') // Cost Management Reader
      }
    ]
  }
}

output appFqdn string = appModule.outputs.fqdn
output appName string = appModule.outputs.appName

infra/main.bicepparam:

using 'main.bicep'

param name = 'mcp-prod'
param image = '<your-registry>.azurecr.io/mcp-server:latest'
param vnetSubnetId = '/subscriptions/.../subnets/aca-infra-001'
param peSubnetId   = '/subscriptions/.../subnets/private-endpoints'
param privateDnsZoneId = '/subscriptions/.../privateDnsZones/privatelink.eastus.azurecontainerapps.io'

The four module composition is deliberate. env gets the Log Analytics workspace and the Container Apps environment in one place because they share a lifecycle. app is the workload itself plus its diagnostics. private-link is the network plumbing. rbac is the security layer.

Each can be reused or replaced independently. If you decide to switch the workload to AKS later, app is the only module that changes; env (well, the LAW), private-link, and rbac come along. If you decide to add a second MCP server in the same env, you reuse env and add a second app. The boundaries match the rate of change, not the namespaces of Bicep.

Step 6: Deploy and verify

RG=rg-mcp-prod
az group create -n $RG -l eastus

az deployment group create \
  -g $RG \
  --template-file infra/main.bicep \
  --parameters infra/main.bicepparam \
  --query 'properties.outputs'

Once it finishes (roughly 3 to 4 minutes, Container Apps env creation is the slow step):

FQDN=$(az containerapp show -g $RG -n ca-mcp-prod \
  --query properties.configuration.ingress.fqdn -o tsv)
echo "FQDN=$FQDN"

The FQDN looks like ca-mcp-prod.internal.<random>.<region>.azurecontainerapps.io. From inside the VNet, this resolves via the private DNS zone to the env's static IP. From outside, nslookup returns NXDOMAIN, which is the right answer.

Step 7: The smoke test

scripts/smoke.sh:

#!/usr/bin/env bash
set -euo pipefail

FQDN="${1:-${FQDN:?missing FQDN}}"

echo "=== /healthz ==="
curl -fsS "https://$FQDN/healthz"
echo

echo "=== open SSE channel ==="
TMP=$(mktemp)
( curl -N -fsS "https://$FQDN/sse" > "$TMP" 2>&1 || true ) &
SSE_PID=$!
sleep 3
kill "$SSE_PID" 2>/dev/null || true

if grep -q "endpoint" "$TMP"; then
  echo "OK: SSE channel emitted endpoint event"
  head -5 "$TMP"
else
  echo "FAIL: no SSE endpoint event"
  cat "$TMP"
  exit 1
fi
rm -f "$TMP"

echo "=== smoke test passed ==="

chmod +x scripts/smoke.sh
./scripts/smoke.sh "$FQDN"

The first run from a peered VNet typically takes 4 to 6 seconds (cold start, image pull, Node.js startup, MCP handshake). Subsequent runs sub-second. If you see consistent 4+ second latencies on subsequent runs, the workload is scaling to zero between every call; either the traffic is naturally that bursty (acceptable) or your scaler's cooldown is too aggressive (worth tuning).

Step 8: One-line live log tail

scripts/tail-logs.sh:

#!/usr/bin/env bash
set -euo pipefail

RG="${RG:?missing RG}"
APP="${APP:-ca-mcp-prod}"

az containerapp logs show \
  -g "$RG" -n "$APP" \
  --container mcp \
  --tail 50 --follow \
  --format text

Useful in a second terminal while the smoke test runs. If you see MCP HTTP transport on :8080 (or whatever your server logs at startup), the container booted and everything after is your application speaking. If you see nothing for 30+ seconds and then Container 'mcp' was terminated with exit code 1, the container started, hit something fatal, and crashed. Logs from that startup window are the only diagnostic.

Step 9: Auto-scale verification

Generate a tiny load:

for i in $(seq 1 100); do
  curl -fsS "https://$FQDN/healthz" >/dev/null &
  [ $((i % 10)) -eq 0 ] && wait
done
wait

az containerapp replica list -g $RG -n ca-mcp-prod --query 'length([])' -o tsv

You should see 1 to 2 replicas if the burst was sustained, then back to the configured minReplicas after a couple of minutes. Adjust concurrentRequests in the scale rule if you want the rule more sensitive (lower) or more relaxed (higher).

A subtle point: this test verifies the scaler. It does not verify what happens to in-flight SSE sessions when scale-up arrives. To test that, hold an SSE connection open in one terminal, run a load burst from another, and verify the SSE connection survives. This is what stickySessions.affinity: 'sticky' from Step 2 is doing. Without it, the held connection breaks the moment a second replica comes up.

Step 10: Hook it up to a VS Code workspace

.vscode/mcp.json:

{
  "mcpServers": {
    "internal-mcp": {
      "type": "sse",
      "url": "https://${input:fqdn}/sse"
    }
  },
  "inputs": [
    { "id": "fqdn", "type": "promptString", "description": "Container App FQDN" }
  ]
}

When VS Code prompts for the FQDN, paste the one from Step 6. Reload Copilot, type @internal-mcp in chat, you should see the available tools.

Step 11: Useful KQL queries

Once diagnostic settings are streaming, build the small set of queries that answer "is the server healthy" without opening the portal.

A tool-call latency view, useful when wiring up an Application Insights workbook:

ContainerAppConsoleLogs_CL
| where TimeGenerated > ago(1h)
| where Log_s contains "tool="
| extend tool = extract(@'tool=(\S+)', 1, Log_s)
| extend ms   = toint(extract(@'duration_ms=(\d+)', 1, Log_s))
| summarize p50=percentile(ms, 50), p95=percentile(ms, 95), n=count() by tool
| order by p95 desc

This works only if your application logs lines like tool=cost_by_service duration_ms=312. Build that into your server's logging from the start; structured logs that you can extract() from KQL are the difference between "I have logs" and "I have observability".

A rate of 4xx-class responses, the alarm worth firing on:

ContainerAppConsoleLogs_CL
| where TimeGenerated > ago(15m)
| where Log_s contains "status_code="
| extend code = toint(extract(@'status_code=(\d+)', 1, Log_s))
| where code >= 400 and code < 500
| summarize n=count() by bin(TimeGenerated, 1m)

A spike here means clients are sending the wrong shape. That's either a client bug, a server-version drift, or someone running an old MCP client against the new server. All three are worth knowing about.

Production checklist

Before pointing real users at this:

Pin the image SHA, not the :latest tag. Predictable rollouts, predictable rollbacks. The day someone pushes a broken image with the same tag is the day you regret using :latest in production.
Set a CPU+memory floor that matches your cold-start budget. 0.5 CPU is fine for a typical MCP server; bump to 1.0 if your tools are CPU-bound (parsing big diffs, running KQL against gigabyte-scale workspaces).
Add a budget alarm on the resource group at $50/month. Container Apps scaled to zero is nearly free; anything above is unexpected. The alarm catches misconfiguration that left replicas pinned to one when you thought they'd scale to zero.
Confirm internal: true on the env. nslookup from outside the VNet should NOT resolve. If it does, your env is public and you're back to the start of this tutorial.
Document the role assignments. That's the audit answer for "who can this server impersonate?". The rbacModule.params.roleAssignments array is the list to paste into the audit report.
Pin ManagedIdentityCredential directly in production code paths. DefaultAzureCredential's walk-the-chain logic adds 1 to 2 seconds to the first request, which is unnecessary tax for a server that knows it's running with a managed identity.

Troubleshooting

Container 'mcp' was terminated with exit code 1, Container started, hit an unhandled exception, exited. Check logs (scripts/tail-logs.sh); usually a missing env var or network reach problem. Bump up the verbosity for the first few minutes after deploy until startup looks clean.

The image '<...>' could not be pulled, The Container App's identity doesn't have AcrPull on the registry, or the image tag doesn't exist. Verify with az acr repository show. If the registry is private, also check the firewall rules; the env's outbound IP needs to be allowed.

SSE drops every 4 minutes, keepalive isn't being emitted. Check the SSE transport implementation (: keepalive\n\n every 25s). Container Apps closes idle HTTP at 240 seconds. Long-lived SSE sessions look idle to the load balancer between events.

502 Bad Gateway from the FQDN, App's listening port doesn't match ingress targetPort, or the app isn't listening on 0.0.0.0. Confirm the container logs listening on 0.0.0.0:8080, not 127.0.0.1:8080. The 0.0.0.0 binding is what makes the port reachable from the env's network namespace.

Private endpoint doesn't resolve from a peered VNet, Private DNS zone isn't linked to the peered VNet. Run az network private-dns link vnet create for each peered VNet. Easy to forget when you add a new VNet later; build it into your VNet provisioning Bicep so it can't drift.

Cold start exceeds 10 seconds, Image is too large, or DefaultAzureCredential is being used. Slim the image (multi-stage build), pin to ManagedIdentityCredential directly. Setting minReplicas: 1 eliminates the cold start at the cost of always-on compute (around $7/month). Worth it for tools engineers use interactively.

Auto-scale never kicks in, The scale rule's concurrentRequests is too high for your traffic. Lower to 5 to verify the rule fires, then tune up. The scaler logs decisions to the env's diagnostic stream, query ContainerAppSystemLogs_CL for ScalerName and look for the rule firing.

What you have at this point

A private-ingress MCP server that:

Survives the laptop closing. The original failure mode is gone.
Is reachable from any peered VNet without internet exposure. The private DNS zone makes the FQDN feel like a normal corporate hostname.
Scales to zero when nobody's asking, scales out when they are. Idle cost rounds to the price of a Log Analytics workspace, which is to say roughly five dollars a month.
Has structured logs flowing to Log Analytics with two pre-built KQL queries to answer the questions that come up first.
Can be handed to anyone with Contributor on the resource group; the operational story is fully captured in Bicep.

The shift in posture is the part that's hard to articulate in a step list. Before, the MCP server was a private artifact owned by one person. After, it's part of the platform. Other engineers can read the Bicep, understand what it does, propose changes, deploy them, and the MCP server's evolution becomes a normal team activity rather than a "go ask whoever runs that server" question.

That shift is the actual deliverable. The 350 lines of Bicep are the means; the cultural change is the end. You now have an internal tool you can hand off, scale up, deprecate, or replace, and none of those operations require a single specific human to be available. That's the property that makes a thing a piece of platform rather than a piece of someone's laptop.

The same module set works for the next MCP server, and the next one. Across our org we now have four internal MCP servers in the same shape, deployed by the same pipeline, sharing the same Log Analytics workspace, with their permissions visible in one Bicep file each. The operational cost of running four servers is barely higher than running one, because the differentiated work was the application logic, and the rest is template.

MCPContainer AppsWorkload Identity