I review Bicep PRs. A lot of them. The team had grown from four engineers writing infrastructure to fourteen, and the rate of "did you check that this resource has diagnostic settings wired up" comments I was leaving had passed the point of usefulness. The interesting comments, the ones that warranted a senior engineer's attention, were getting buried under repetitive ones that any junior with a checklist could write.
The boring comments are the ones to automate. Not all of them, just the ones that follow a fixed pattern: secret in cleartext, public network access still on, missing diagnostics, missing tags, RBAC scoped too broadly. Those are recognisable rules. The interesting comments, the ones about whether the architecture makes sense, those stay with humans.
This post is the entire build of a Bicep PR reviewer bot. By the end you have a GitHub Actions workflow that triggers on every PR touching **.bicep, calls Azure OpenAI with a domain-specific prompt, posts findings back as a single PR comment via the GitHub REST API, and runs on OIDC federated identity with a budget cap. Total provisioning plus code: about 250 lines. Total cost on a 40-engineer team running ~120 PRs a week: around $20 a month. Reproducible from an empty repo.
Why a bot, and not a linter, and not a checklist
Brief detour because the tooling choice matters and it's the question a senior engineer always asks before approving the work.
A static linter (like bicep build plus a custom rule set) covers about 60% of what's findable. The other 40% needs context the linter doesn't have, things like "this storage account is downstream of a workload that handles PII, so public network access being enabled is worse than it would be on a marketing-site bucket". Linters can't reason about which workload a resource belongs to. A model can, if the diff includes the workload name and the project structure makes the lineage obvious.
A checklist on the PR template covers some of it but relies on the author to read and apply. Six months of data on one team showed checklist compliance at 41%, which is to say slightly worse than coin-flip. The author skipping the checklist is normal human behaviour; the bot doesn't skip.
The bot is the third thing. It catches what the linter can't, doesn't depend on the author, and runs in twenty seconds at the end of CI. The bot's findings are correct about three quarters of the time on novel patterns and approaching ninety percent on the recognisable ones from the prompt. Reviewers spend the saved attention on architecture instead of typo'ing tag values.
Repo layout
~/bicep-pr-bot/
├── .github/
│ └── workflows/
│ └── bicep-review.yml
├── reviewer/
│ ├── __init__.py
│ ├── __main__.py
│ ├── diff.py
│ ├── prompt.py
│ ├── openai_client.py
│ └── github_client.py
├── infra/
│ ├── main.bicep
│ ├── main.bicepparam
│ └── federated-identity.bicep
├── tests/
│ ├── fixtures/
│ │ └── bicep_diff_sample.txt
│ └── test_prompt.py
├── pyproject.toml
└── README.md
The Python is split deliberately into four files. Each one is small enough to read on a phone. diff.py captures and filters; prompt.py is the system prompt and templating; openai_client.py is the AOAI wrapper; github_client.py is the comment-posting REST plumbing. Each can be unit-tested without the others. If any of these grows past 100 lines, the design is wrong and the file is doing two things.
Prerequisites
python3 --version # 3.12 or newer
az --version # 2.65 or newer
gh --version # 2.50 or newer
You also need:
- A subscription where you can create resource groups and Azure OpenAI deployments
- A model deployment in your region.
gpt-4oworks,gpt-4o-miniworks and is roughly five times cheaper (we'll use the cheaper one) - A GitHub repository to install the bot on
az login
az account set --subscription "<your-subscription-id>"
gh auth login
A note on model choice. I started with gpt-4o because the per-PR cost on a tiny team was negligible. As the team grew, the bill rose linearly with PR count and switching to gpt-4o-mini cut it 80% with maybe a 5% drop in catch rate. The mini model is the right default; reach for gpt-4o only if you measure a specific class of finding it misses.
Step 1: Provision Azure OpenAI with Bicep
infra/main.bicep:
param location string = resourceGroup().location
param aoaiName string = 'aoai-bicepbot-${uniqueString(resourceGroup().id)}'
param modelName string = 'gpt-4o-mini'
param modelVersion string = '2024-07-18'
param deploymentName string = 'reviewer'
param tpmCap int = 30 // thousand tokens per minute
resource aoai 'Microsoft.CognitiveServices/accounts@2024-10-01' = {
name: aoaiName
location: location
kind: 'OpenAI'
sku: { name: 'S0' }
properties: {
customSubDomainName: aoaiName
publicNetworkAccess: 'Enabled' // tighten in production
networkAcls: { defaultAction: 'Allow' }
disableLocalAuth: false // we'll use Entra ID later; key auth is fine for dev
}
}
resource deployment 'Microsoft.CognitiveServices/accounts/deployments@2024-10-01' = {
parent: aoai
name: deploymentName
sku: { name: 'GlobalStandard', capacity: tpmCap }
properties: {
model: { format: 'OpenAI', name: modelName, version: modelVersion }
raiPolicyName: 'Microsoft.DefaultV2'
versionUpgradeOption: 'OnceCurrentVersionExpired'
}
}
output endpoint string = aoai.properties.endpoint
output deploymentName string = deployment.name
output aoaiId string = aoai.id
A few small but consequential bits in this template.
tpmCap: 30 is your hard ceiling on tokens per minute. A misbehaving prompt or an attempt to review an enormous PR cannot consume more than this. The default capacity for new deployments has been generous in 2025, but generous is the wrong default for a bot, you want a measured cap that matches actual traffic plus a margin.
publicNetworkAccess: 'Enabled' is wrong for production. Production wants this disabled and the bot calling AOAI through a private endpoint. For this tutorial it stays enabled because the alternative is adding a vnet plus private endpoint plus DNS zone plus runner integration plus connectivity tests and we want a working bot in one sitting. The hardening checklist near the end calls this out as the first thing to fix on day two.
versionUpgradeOption: 'OnceCurrentVersionExpired' is the policy I recommend on every AOAI deployment. The default 'NoAutoUpgrade' leaves you on a model that gets retired without warning; the alternative 'OnceNewDefaultVersionAvailable' upgrades aggressively and surprises you mid-week. The middle option is the only one that's both predictable and not a slow security incident.
Deploy:
RG=rg-bicepbot
az group create -n $RG -l eastus
az deployment group create \
-g $RG \
--template-file infra/main.bicep \
--parameters infra/main.bicepparam \
--query 'properties.outputs'
infra/main.bicepparam:
using 'main.bicep'
param location = 'eastus'
param tpmCap = 30
Note the endpoint and deploymentName from the output, you'll use them shortly.
Add a budget so a runaway can never bill you more than $50/month:
SUB=$(az account show --query id -o tsv)
az consumption budget create \
--amount 50 --time-grain Monthly \
--start-date 2026-05-01 --end-date 2027-05-01 \
--budget-name bicepbot-cap \
--resource-group $RG \
--notifications "[{
\"enabled\": true, \"threshold\": 80, \"contactEmails\": [\"you@example.com\"], \"operator\": \"GreaterThan\"
}]"
The budget is not a hard limit. Azure budgets alert; they don't enforce. Don't ship a bot that depends on a budget for cost safety; depend on the tpmCap. The budget's role is to tell you the bot is misbehaving, not to stop it.
Step 2: Set up OIDC federated identity
infra/federated-identity.bicep:
extension microsoftGraphV1
param appDisplayName string = 'github-bicepbot'
param repoSubject string = 'repo:dammyboss/bicep-pr-bot'
resource app 'Microsoft.Graph/applications@v1.0' = {
uniqueName: appDisplayName
displayName: appDisplayName
}
resource sp 'Microsoft.Graph/servicePrincipals@v1.0' = { appId: app.appId }
resource fedPr 'Microsoft.Graph/applications/federatedIdentityCredentials@v1.0' = {
parent: app
name: 'github-pr'
properties: {
issuer: 'https://token.actions.githubusercontent.com'
subject: '${repoSubject}:pull_request'
audiences: ['api://AzureADTokenExchange']
}
}
output appId string = app.appId
output principalId string = sp.id
For convenience, the same provisioning via CLI:
APP_ID=$(az ad app create --display-name "github-bicepbot" --query appId -o tsv)
az ad sp create --id "$APP_ID"
az ad app federated-credential create --id "$APP_ID" --parameters '{
"name": "github-pr",
"issuer": "https://token.actions.githubusercontent.com",
"subject": "repo:dammyboss/bicep-pr-bot:pull_request",
"audiences": ["api://AzureADTokenExchange"]
}'
# Cognitive Services User on the AOAI account
SP_OBJ=$(az ad sp show --id "$APP_ID" --query id -o tsv)
AOAI_ID=$(az cognitiveservices account show -g $RG -n "<aoaiName from output>" --query id -o tsv)
az role assignment create \
--assignee-object-id "$SP_OBJ" --assignee-principal-type ServicePrincipal \
--role "Cognitive Services User" --scope "$AOAI_ID"
Two things worth pausing on.
The federated credential subject is repo:dammyboss/bicep-pr-bot:pull_request, scoped to the pull_request event. That subject does not match a token issued by a push event or a workflow_dispatch, which means even if someone changed the workflow file to run on push, the OIDC exchange would fail. The credential is tied to the type of event you intended. This is the security model and it's worth respecting; don't loosen the subject for convenience.
The role assignment is Cognitive Services User, scoped to the AOAI account, not subscription. The bot can call inference endpoints on this one resource and nothing else. If you reuse this app for a different bot, give it a different federated credential and a different RBAC scope. Apps are cheap; sprawl is a security finding.
Set GitHub repo variables:
gh variable set AZURE_CLIENT_ID --body "$APP_ID"
gh variable set AZURE_TENANT_ID --body "$(az account show --query tenantId -o tsv)"
gh variable set AZURE_SUBSCRIPTION_ID --body "$(az account show --query id -o tsv)"
gh variable set AOAI_ENDPOINT --body "https://<aoaiName>.openai.azure.com"
gh variable set AOAI_DEPLOYMENT --body "reviewer"
vars, not secrets. Nothing here is confidential. Engineers reading the workflow can see the IDs and that's fine; the security comes from the federated trust relationship, not from hiding identifiers.
Step 3: Project setup
mkdir bicep-pr-bot && cd bicep-pr-bot
git init -b main
mkdir -p reviewer tests/fixtures infra .github/workflows
pyproject.toml:
[project]
name = "bicep-pr-reviewer"
version = "0.1.0"
requires-python = ">=3.12"
dependencies = [
"openai>=1.50",
"azure-identity>=1.18",
"httpx>=0.27",
]
[project.optional-dependencies]
dev = ["pytest>=8.3", "pytest-asyncio>=0.24"]
[tool.pytest.ini_options]
testpaths = ["tests"]
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
The dependency list is short on purpose. openai is the official SDK. azure-identity is the standard Entra ID auth chain. httpx is a more honest HTTP client than requests (proper async, proper timeouts, a sensible default verify). No PyGithub, no LangChain, no semantic-kernel. The bot's contract is "read a diff, ask AOAI, post a comment". Anything beyond that is a refactor away from the contract and should be argued for, not adopted.
Step 4: Capture the diff
reviewer/diff.py:
"""Capture and filter a PR diff."""
from __future__ import annotations
import subprocess
from pathlib import Path
INTERESTING_SUFFIXES = {".bicep", ".bicepparam"}
MAX_DIFF_BYTES = 60_000 # ~12K tokens, safe for gpt-4o-mini
def capture_diff(base_ref: str, head_ref: str = "HEAD") -> str:
"""Return a unified diff filtered to Bicep files only."""
subprocess.run(["git", "fetch", "--no-tags", "origin", base_ref], check=True)
proc = subprocess.run(
["git", "diff", "--unified=3", f"origin/{base_ref}...{head_ref}"],
check=True, capture_output=True, text=True,
)
diff = proc.stdout
return filter_to_bicep(diff)
def filter_to_bicep(unified_diff: str) -> str:
"""Strip non-Bicep file sections from a unified diff."""
out: list[str] = []
keep = False
for line in unified_diff.splitlines(keepends=True):
if line.startswith("diff --git "):
parts = line.split()
path = parts[2][2:] if parts[2].startswith("a/") else parts[2]
keep = Path(path).suffix in INTERESTING_SUFFIXES
if keep:
out.append(line)
result = "".join(out)
if len(result.encode()) > MAX_DIFF_BYTES:
return result.encode()[:MAX_DIFF_BYTES].decode(errors="ignore") + "\n... [truncated]"
return result
The 60KB cap is the practical version of "don't blow up on enormous PRs". 60KB is roughly 12K tokens of Bicep, which leaves room for the system prompt and the model's response inside gpt-4o-mini's context. Bigger PRs get truncated with an explicit marker, which the model can then mention in the comment so reviewers know not to trust the bot's "all clear" signal on a half-reviewed diff.
The Bicep-only filter does double duty. It saves tokens, and it keeps the bot's findings on-topic. A diff that includes a 4KB README change alongside a Bicep edit produces, occasionally, a finding about the README. Filtering up front prevents that drift.
Step 5: The prompt
reviewer/prompt.py:
"""The system + user prompts. The system prompt is the actual product."""
SYSTEM = """\
You are a Bicep code reviewer for a platform engineering team.
REVIEW THIS PULL REQUEST DIFF for the following SPECIFIC issues. Do NOT include
generic advice or style commentary.
CATEGORIES:
1. Hard-coded secrets: any password, key, connection string, SAS token in clear text.
2. Public ingress on private resources: storageAccount.publicNetworkAccess = "Enabled" on
workloads that should be private; Cognitive Services with publicNetworkAccess Enabled.
3. Missing diagnostic settings: top-level resources without a corresponding diagnosticSettings
resource (or call to the diagnostic-baseline module).
4. Loose RBAC: role assignments at subscription scope when resource-group scope would suffice.
Role definition IDs of "Owner" or "Contributor" at subscription scope without justification.
5. Resource naming: resources without explicit names that depend on uniqueString(), these are
surprising on redeploy.
6. Missing tags: top-level resources missing the org-required tags (env, owner, costCenter).
OUTPUT FORMAT:
- If you find no issues: respond with EXACTLY the string `NO_ISSUES_FOUND`.
- Otherwise: respond with markdown bullet list, each bullet starting with the category number,
the file:line reference (use the @@ hunk headers in the diff), and a single-sentence finding.
DO NOT comment on style, naming preferences, or "could be improved" suggestions.
DO NOT speculate about behaviour you can't see in the diff.
"""
USER_TEMPLATE = """\
PR title: {title}
PR description: {description}
DIFF:
diff
{diff}
"""
def build_messages(title: str, description: str, diff: str) -> list[dict]:
return [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": USER_TEMPLATE.format(
title=title,
description=(description or "(no description)").strip(),
diff=diff,
)},
]
The system prompt is the actual product, and it earns its lines through hard-won restraint.
The first version told the model to "review this Bicep code for issues". It produced exactly the kind of useless feedback you'd expect: "consider adding error handling to your role assignment", "this variable name could be clearer". Worthless output that buried the few real findings. Six revisions later, the prompt looks like the one above and the productive change at every step was removing permission rather than adding it. The model wants to be helpful; the prompt's job is to clamp the helpfulness to the categories you measured.
The NO_ISSUES_FOUND sentinel matters more than its size suggests. Without it, the model invents issues to fill the response space. With it, the comment-posting step can short-circuit when there's nothing to say, and reviewers stop seeing "everything looks fine" comments that train them to ignore the bot.
The DO NOT lines aren't redundant. Without them, the model regularly slips in style observations, even after the categories are listed. Negative instructions plus positive instructions outperform either alone. The prompt is small enough that the redundancy is cheap.
Step 6: The Azure OpenAI client
reviewer/openai_client.py:
"""Wrapper around openai SDK with Entra ID auth and a budget guard."""
from __future__ import annotations
import os
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
ENDPOINT = os.environ["AOAI_ENDPOINT"]
DEPLOYMENT = os.environ["AOAI_DEPLOYMENT"]
API_VERSION = "2024-10-21"
MAX_OUTPUT_TOKENS = 800
TEMPERATURE = 0.0
token_provider = get_bearer_token_provider(
DefaultAzureCredential(),
"https://cognitiveservices.azure.com/.default",
)
_client = AzureOpenAI(
azure_endpoint=ENDPOINT,
api_version=API_VERSION,
azure_ad_token_provider=token_provider,
)
def review(messages: list[dict]) -> str:
completion = _client.chat.completions.create(
model=DEPLOYMENT,
messages=messages,
temperature=TEMPERATURE,
max_tokens=MAX_OUTPUT_TOKENS,
)
return completion.choices[0].message.content or ""
Two guard rails baked in.
temperature=0.0 makes the output deterministic-ish. Same diff, same output, mostly. AOAI doesn't guarantee determinism even at zero, but in practice the variance drops far enough that PR review feels stable instead of flapping. Without this, two runs of the same workflow would occasionally produce different findings, which destroys reviewer trust faster than any false positive does.
max_tokens=800 caps the output to roughly $0.0008 per call on gpt-4o-mini. A misbehaving prompt that loops the model into running on for paragraphs cannot blow the budget. Combined with the tpmCap from earlier, the bot has both a per-call ceiling and a per-minute ceiling. Belt and braces.
Step 7: Post the comment back to the PR
reviewer/github_client.py:
"""Post a single PR-level comment via the GitHub REST API."""
from __future__ import annotations
import os
import httpx
GH_TOKEN = os.environ["GITHUB_TOKEN"]
GH_REPO = os.environ["GITHUB_REPOSITORY"]
PR_NUMBER = int(os.environ["GITHUB_PR_NUMBER"])
API = "https://api.github.com"
HEADERS = {
"Authorization": f"Bearer {GH_TOKEN}",
"Accept": "application/vnd.github+json",
"X-GitHub-Api-Version": "2022-11-28",
}
def post_or_update(body: str) -> None:
"""Idempotent: edit the existing bot comment if found, else create."""
marker = "<!-- bicep-pr-reviewer -->"
body_with_marker = f"{marker}\n{body}"
with httpx.Client(timeout=30.0, headers=HEADERS) as client:
existing = client.get(f"{API}/repos/{GH_REPO}/issues/{PR_NUMBER}/comments").json()
for c in existing:
if marker in (c.get("body") or ""):
client.patch(
f"{API}/repos/{GH_REPO}/issues/comments/{c['id']}",
json={"body": body_with_marker},
).raise_for_status()
return
client.post(
f"{API}/repos/{GH_REPO}/issues/{PR_NUMBER}/comments",
json={"body": body_with_marker},
).raise_for_status()
The HTML marker comment is what makes this idempotent. Every workflow run searches for the previous bot comment and edits it instead of stacking. A reviewer looking at a 30-commit PR sees one bot comment that reflects the current state, not 30 bot comments scrolling past every actually useful conversation.
This is the single most-overlooked detail in PR-bot tutorials, and it's the thing that determines whether reviewers stop muting your bot's notifications. Stacking comments trains people to filter the bot out; idempotent comments train them to read it.
Step 8: Wire it together
reviewer/__main__.py:
"""Entry point. Run from the workflow."""
from __future__ import annotations
import os
import sys
from .diff import capture_diff
from .prompt import build_messages
from .openai_client import review
from .github_client import post_or_update
NO_ISSUES_SENTINEL = "NO_ISSUES_FOUND"
def main() -> int:
base_ref = os.environ["GITHUB_BASE_REF"]
title = os.environ.get("PR_TITLE", "")
description = os.environ.get("PR_BODY", "")
diff = capture_diff(base_ref).strip()
if not diff:
print("no Bicep changes; skipping")
return 0
messages = build_messages(title=title, description=description, diff=diff)
response = review(messages).strip()
if response == NO_ISSUES_SENTINEL or not response:
print("clean review; skipping comment")
return 0
print(f"posting review to PR ({len(response)} chars)")
post_or_update(format_comment(response))
return 0
def format_comment(review_body: str) -> str:
return (
"### Bicep review\n\n"
f"{review_body}\n\n"
"---\n"
"*Generated by `bicep-pr-reviewer`. Add `[skip review]` to the PR title or "
"include the label `skip-bicep-review` to silence.*"
)
if __name__ == "__main__":
sys.exit(main())
The footer naming the escape hatch is non-obvious but important. Engineers will hit a case where the bot is wrong and they want to merge anyway. Without an escape hatch, they file a ticket against the bot, the bot owner has to respond, the PR sits while the conversation happens. With an escape hatch named in the bot's own comment, the engineer adds [skip review] to the title and ships. The bot has now demonstrated humility and the team's velocity isn't held hostage to the bot's correctness.
Step 9: The workflow
.github/workflows/bicep-review.yml:
name: bicep-review
on:
pull_request:
paths: ['**/*.bicep', '**/*.bicepparam']
permissions:
id-token: write
contents: read
pull-requests: write # allow posting comments
concurrency:
group: bicep-review-${{ github.event.pull_request.number }}
cancel-in-progress: true
jobs:
review:
if: |
!contains(github.event.pull_request.title, '[skip review]') &&
!contains(github.event.pull_request.labels.*.name, 'skip-bicep-review')
runs-on: ubuntu-latest
timeout-minutes: 8
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # need history for git diff against base ref
- uses: actions/setup-python@v5
with: { python-version: '3.12' }
- run: pip install -e .
- uses: azure/login@v2
with:
client-id: ${{ vars.AZURE_CLIENT_ID }}
tenant-id: ${{ vars.AZURE_TENANT_ID }}
subscription-id: ${{ vars.AZURE_SUBSCRIPTION_ID }}
- name: Run reviewer
env:
AOAI_ENDPOINT: ${{ vars.AOAI_ENDPOINT }}
AOAI_DEPLOYMENT: ${{ vars.AOAI_DEPLOYMENT }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GITHUB_PR_NUMBER: ${{ github.event.pull_request.number }}
PR_TITLE: ${{ github.event.pull_request.title }}
PR_BODY: ${{ github.event.pull_request.body }}
run: python -m reviewer
fetch-depth: 0 is the line that makes git diff origin/main...HEAD actually return content. Without it, the runner has a shallow clone, the diff is empty, the bot says "no Bicep changes; skipping", and you spend an hour wondering why the bot won't run. Most CI tutorials don't mention this because most CI tutorials don't git diff. The bot does, and it needs the history.
The concurrency.cancel-in-progress: true block is small but earns its keep. A force-pushed PR with 30 commits would otherwise spawn 30 parallel review runs, all racing to post the same comment. Cancellation means only the latest commit gets reviewed, which matches what reviewers want to see anyway.
Step 10: Reproduce on a sample PR
Push the bot to your repo:
git add . && git commit -m "initial reviewer" && git push -u origin main
Open a test PR with a deliberately bad Bicep file:
git checkout -b test-reviewer
cat > test-bad.bicep <<'EOF'
resource sa 'Microsoft.Storage/storageAccounts@2024-01-01' = {
name: 'badexample123'
location: 'eastus'
sku: { name: 'Standard_LRS' }
kind: 'StorageV2'
properties: {
publicNetworkAccess: 'Enabled'
accessKey: 'AKAI...REDACTED...HARDCODED-FOR-TEST' // intentional bad
}
}
EOF
git add test-bad.bicep
git commit -m "test: bad bicep for reviewer"
git push -u origin test-reviewer
gh pr create --title "Test reviewer" --body "ignore, testing the bot"
Within ~90 seconds the PR will have a comment that flags both the public network access and the hard-coded access key. Push a fix and the workflow will overwrite the comment, because of the marker.
Verify the categories triggered:
gh pr view --json comments | jq '.comments[].body' | grep -E '1\.|2\.'
A small ritual that helps adoption: run this same test on every prompt change you make to the bot. Treat the system prompt like production code, with a regression suite of "PRs the bot used to flag correctly". I have nine of these stashed away; the prompt has changed five times and the regression set has caught two prompt regressions early.
Step 11: Test the prompt offline
tests/fixtures/bicep_diff_sample.txt:
diff --git a/main.bicep b/main.bicep
index 0000..1111 100644
--- a/main.bicep
+++ b/main.bicep
@@ -10,6 +10,7 @@ resource sa 'Microsoft.Storage/storageAccounts@2024-01-01' = {
sku: { name: 'Standard_LRS' }
properties: {
publicNetworkAccess: 'Enabled'
+ accessKey: 'AKAI-HARDCODED'
}
}
tests/test_prompt.py:
import os
from pathlib import Path
from reviewer.prompt import build_messages, SYSTEM
def test_messages_include_diff():
diff = Path("tests/fixtures/bicep_diff_sample.txt").read_text()
msgs = build_messages(title="t", description="d", diff=diff)
assert msgs[0]["role"] == "system"
assert "NO_ISSUES_FOUND" in msgs[0]["content"]
assert diff in msgs[1]["content"]
def test_system_prompt_lists_six_categories():
for n in range(1, 7):
assert f"\n{n}." in SYSTEM
pytest -q
These tests are deliberately dumb. They check the prompt is structurally what the categories list expects, and that a sample diff makes it through templating. They don't assert behaviour because behaviour is non-deterministic and asserting on real model output makes the suite flaky. The behavioural test happens via the PR-replay pattern from the previous step; the unit tests just keep the prompt's shape from rotting silently.
Hardening checklist for production
Before pointing this at a real repository, the day-two work that the tutorial deliberately deferred:
Disable public network access on the AOAI account. The tutorial leaves it on; production should turn it off and reach AOAI via a private endpoint from a dedicated GitHub Actions runner pool that lives inside your VNet. This is the biggest gap between the tutorial and a production setup.
Tighten the federated credential subject. The example uses
repo:owner/repo:pull_request, which lets every PR mint the token. If your bot only needs to run against PRs targetingmain, narrow topull_request:base:main. Smaller surface, smaller blast radius if a malicious PR ever ships.Pin the AOAI deployment version.
versionUpgradeOption: 'OnceCurrentVersionExpired'is in the Bicep already; verify it's still there. A model swap is the bug that lands in production with no PR attached.Add an exclusion path. If you have generated Bicep files (e.g. compiled from CDKTF, Terraform-imported, or auto-generated by a code-gen tool), they shouldn't trigger the reviewer. Add
paths-ignore: ['**/generated/**']to the workflow.Cap concurrency. The
concurrencyblock in the workflow stops a 30-commit force-push from spawning 30 reviews. Already set above.Watch the model choice.
gpt-4o-miniis the default here for cost. For high-stakes Bicep changes (production deployment templates, anything in the platform monorepo), swap togpt-4o. Worth A/B testing both against your historical findings before committing.
Numbers, after six months
These are the figures from the team I shipped this for. Sample of one organisation, not a benchmark:
- Review latency: 10 to 20 seconds, most of which is GitHub Actions runner spin-up. The AOAI call itself is 1 to 3 seconds on
gpt-4o-mini. - Cost: ~$15 to $20 a month at 800 max output tokens. Roughly five times that on
gpt-4o. - Catch rate: the bot flagged real issues on about 22% of Bicep PRs. Two-thirds of those flags were actionable; one-third were false positives, mostly "loose RBAC" complaints that had a justification the bot couldn't see.
- Adoption: the
[skip review]escape hatch was used on roughly 4% of PRs. Most of those were either the bot being wrong or a legitimate "this PR is broken on purpose" exploratory branch. We never had a "bot is permanently muted by team X" complaint.
The catch rate isn't enormous, and that's fine. The bot's job isn't to find every issue; it's to consistently find the recognisable ones so the human reviewer doesn't have to. A 22% hit rate that's predictable and idempotent beats a 60% hit rate that flaps and gets ignored.
Troubleshooting
AADSTS70021: No matching federated identity record is the error you'll see most often during initial setup. The subject claim on the OIDC token doesn't match the federated credential. Dump the token claims and compare them to what the federated credential expects.
401 Unauthorized from AOAI usually means the SP doesn't have Cognitive Services User on the AOAI account. Re-run the role assignment from Step 2.
fatal: bad object origin/main is fetch-depth: 0 missing from the workflow checkout. Add it.
No Bicep diff captured when there clearly are Bicep changes typically means your PR target branch isn't main, it's something like develop. The script uses GITHUB_BASE_REF so this should work, but confirm in the workflow log that the right base ref is being passed.
Bot leaves a comment that just says NO_ISSUES_FOUND means the sentinel didn't get filtered out. Confirm __main__.py is comparing the raw response (stripped) against the constant, not against f"{NO_ISSUES_SENTINEL}\n" or similar.
Cost spike rarely comes from output tokens, since max_tokens=800 caps those tightly. Check the prompt-token count in the AOAI metrics; if it's high, your diffs are huge and the truncation cap should be lowered. At 30 PRs a day with tpmCap=30K, the budget should round to single-digit dollars per month. If it's higher, something else is calling the deployment.
What this gives you, and the part that's hard to measure
You have a bot that catches the boring class of Bicep issues consistently. Hardcoded secrets that would have made it into a commit, public network access that nobody noticed, role assignments at the wrong scope, missing diagnostics. None of these are individually expensive when caught at PR time. All of them are very expensive when caught later, because "caught later" means "by Defender for Cloud after the resource is in production" or worse.
The harder thing to measure, and the thing that makes the bot worth shipping, is what reviewers do with the time they get back. In the months after this bot landed for the team I built it on, the architecture comments on Bicep PRs got noticeably sharper. Senior reviewers stopped writing "did you wire diagnostic settings to the platform LAW" and started writing "is this resource in the right landing zone, given the data classification". The bot handled the predictable; humans were freed to handle the situated.
That shift is the actual product. A bot that catches publicNetworkAccess: Enabled is, on its face, a toy. A bot that lets your senior engineers think about architecture instead of typo-hunting is a quietly significant change to how a platform team operates.
Two hundred and fifty lines of code, one OIDC federation, one AOAI deployment, twenty dollars a month. The bill on the not-doing-this side is much higher than that, and you only see it in the form of repeat-finding incidents and senior engineers writing the same comment for the eighth time.

Conversation
Reactions & commentsLiked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.