The first time we tried this, the bot left a comment on every PR that just said "Looks good!" — including on a PR that introduced a hard-coded SAS token. That single failure forced us to rebuild the entire setup from scratch, and the version we're running now catches things our human reviewers miss.
This is the architecture, the prompt, and the three things that broke before we got it stable.
The setup we landed on
The pipeline is dead simple in shape: every PR triggers an Azure DevOps pipeline that diffs the changes, hands the diff to an Azure OpenAI deployment, and posts the response back as a PR comment via the Azure DevOps REST API.
trigger: none
pr:
branches:
include: ['*']
pool:
vmImage: ubuntu-latest
steps:
- bash: |
git fetch origin $(System.PullRequest.TargetBranch)
git diff --unified=3 origin/$(System.PullRequest.TargetBranch)...HEAD \
> /tmp/pr.diff
displayName: Capture diff
- bash: |
python scripts/review.py /tmp/pr.diff
env:
AZURE_OPENAI_ENDPOINT: $(AOAI_ENDPOINT)
AZURE_OPENAI_KEY: $(AOAI_KEY)
ADO_PAT: $(ADO_PAT)
PR_ID: $(System.PullRequest.PullRequestId)
displayName: AI review
The Python script does three things: builds the prompt, calls Azure OpenAI with gpt-4o, and POSTs the result to /_apis/git/repositories/{repo}/pullRequests/{pr}/threads.
Total LOC: 140 lines. The interesting work is the prompt.
The prompt that actually catches things
We started with a generic "review this code" prompt. It produced exactly the kind of useless feedback you'd expect — "consider adding error handling", "this variable name could be clearer". Worthless.
What changed it was making the prompt adversarial and codebase-specific:
You are reviewing a pull request for a Python service that runs on AKS,
talks to PostgreSQL, and handles PII.
Find SPECIFIC issues in these categories — do NOT include generic advice:
1. Secrets: any token, key, password, or connection string in plain text.
2. PII handling: any log line that could capture user email, name, or ID.
3. SQL: any string concatenation into a query (we use SQLAlchemy params).
4. Async: any blocking I/O call inside an async function.
5. Test coverage: any new public function without a corresponding test.
If you find none, respond with exactly: NO_ISSUES_FOUND.
Do NOT comment on style, naming, or "could be improved" — only the five
categories above.
Diff to review:
{diff}
The NO_ISSUES_FOUND sentinel matters more than it sounds. Without it, the model invents issues to fill space. With it, the comment-posting step skips clean PRs entirely, which keeps reviewer trust.
The three things that broke
1. The diff was too big. Anything over ~12K tokens started producing hallucinations — referring to functions that didn't exist, claiming files were missing that were right there. We added a pre-filter that drops generated files (*.lock, dist/, node_modules/) and chunks remaining hunks per-file with a 4K-token-per-call ceiling. If a diff exceeds the budget, the bot leaves a single comment saying "PR too large for AI review — please request human review" rather than producing partial output.
2. The PAT was too privileged. We started with a Personal Access Token scoped to "Code (read & write)" because we'd seen that in tutorials. That token can also force-push and delete branches. We swapped to a pipeline-scoped service principal with only vso.code_status and vso.threads permissions. Took an hour. Worth it the moment your bot's credentials show up in someone's compromised laptop.
3. The model was non-deterministic in a way that mattered. Same diff, two runs, sometimes different findings. We set temperature=0.0 and added a one-line system instruction: "Be deterministic. Same input must produce same output." That's not a real guarantee from the API, but it noticeably reduced flapping.
What it actually catches
Over six months of running this on a 40-engineer team:
- 14 hard-coded credentials before they hit
main - 9
print()statements logging full request bodies (which contained user emails) - 23 cases of
f"... {user_input} ..."in a SQL string - ~60 missing tests on new public functions
Cost: about $40/month in Azure OpenAI tokens for ~120 PRs/week. We're using gpt-4o; switching to gpt-4o-mini cuts cost ~80% and the catch rate dropped maybe 5%. Probably worth it for high-volume teams.
What I'd do differently
If I were starting today, I'd build this on top of the Azure DevOps MCP Server with a GitHub Copilot agent rather than rolling the REST plumbing myself. The wiring is cleaner and you get the diffing + comment-posting for free. The only reason we still run the pipeline version is that we already have it and it works.
I would NOT add a "rate limit by author" feature, even though every team asks for it. The point of the bot is unblinking consistency. Letting individual contributors mute it defeats the purpose.

Conversation
Reactions & commentsLiked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.