Skip to content
A
No. 01DevOpsJul 5, 20258 min read

Plugging Azure OpenAI Into Azure Pipelines for PR Review: A Real-World Setup

The first time we tried this, the bot left a comment on every PR that just said "Looks good!" — including on a PR that introduced a hard-coded SAS token.

The first time we tried this, the bot left a comment on every PR that just said "Looks good!" — including on a PR that introduced a hard-coded SAS token. That single failure forced us to rebuild the entire setup from scratch, and the version we're running now catches things our human reviewers miss.

This is the architecture, the prompt, and the three things that broke before we got it stable.

The setup we landed on

The pipeline is dead simple in shape: every PR triggers an Azure DevOps pipeline that diffs the changes, hands the diff to an Azure OpenAI deployment, and posts the response back as a PR comment via the Azure DevOps REST API.

trigger: none
pr:
  branches:
    include: ['*']

pool:
  vmImage: ubuntu-latest

steps:
  - bash: |
      git fetch origin $(System.PullRequest.TargetBranch)
      git diff --unified=3 origin/$(System.PullRequest.TargetBranch)...HEAD \
        > /tmp/pr.diff
    displayName: Capture diff

  - bash: |
      python scripts/review.py /tmp/pr.diff
    env:
      AZURE_OPENAI_ENDPOINT: $(AOAI_ENDPOINT)
      AZURE_OPENAI_KEY: $(AOAI_KEY)
      ADO_PAT: $(ADO_PAT)
      PR_ID: $(System.PullRequest.PullRequestId)
    displayName: AI review

The Python script does three things: builds the prompt, calls Azure OpenAI with gpt-4o, and POSTs the result to /_apis/git/repositories/{repo}/pullRequests/{pr}/threads.

Total LOC: 140 lines. The interesting work is the prompt.

The prompt that actually catches things

We started with a generic "review this code" prompt. It produced exactly the kind of useless feedback you'd expect — "consider adding error handling", "this variable name could be clearer". Worthless.

What changed it was making the prompt adversarial and codebase-specific:

You are reviewing a pull request for a Python service that runs on AKS,
talks to PostgreSQL, and handles PII.

Find SPECIFIC issues in these categories — do NOT include generic advice:

1. Secrets: any token, key, password, or connection string in plain text.
2. PII handling: any log line that could capture user email, name, or ID.
3. SQL: any string concatenation into a query (we use SQLAlchemy params).
4. Async: any blocking I/O call inside an async function.
5. Test coverage: any new public function without a corresponding test.

If you find none, respond with exactly: NO_ISSUES_FOUND.
Do NOT comment on style, naming, or "could be improved" — only the five
categories above.

Diff to review:
{diff}

The NO_ISSUES_FOUND sentinel matters more than it sounds. Without it, the model invents issues to fill space. With it, the comment-posting step skips clean PRs entirely, which keeps reviewer trust.

The three things that broke

1. The diff was too big. Anything over ~12K tokens started producing hallucinations — referring to functions that didn't exist, claiming files were missing that were right there. We added a pre-filter that drops generated files (*.lock, dist/, node_modules/) and chunks remaining hunks per-file with a 4K-token-per-call ceiling. If a diff exceeds the budget, the bot leaves a single comment saying "PR too large for AI review — please request human review" rather than producing partial output.

2. The PAT was too privileged. We started with a Personal Access Token scoped to "Code (read & write)" because we'd seen that in tutorials. That token can also force-push and delete branches. We swapped to a pipeline-scoped service principal with only vso.code_status and vso.threads permissions. Took an hour. Worth it the moment your bot's credentials show up in someone's compromised laptop.

3. The model was non-deterministic in a way that mattered. Same diff, two runs, sometimes different findings. We set temperature=0.0 and added a one-line system instruction: "Be deterministic. Same input must produce same output." That's not a real guarantee from the API, but it noticeably reduced flapping.

What it actually catches

Over six months of running this on a 40-engineer team:

  • 14 hard-coded credentials before they hit main
  • 9 print() statements logging full request bodies (which contained user emails)
  • 23 cases of f"... {user_input} ..." in a SQL string
  • ~60 missing tests on new public functions

Cost: about $40/month in Azure OpenAI tokens for ~120 PRs/week. We're using gpt-4o; switching to gpt-4o-mini cuts cost ~80% and the catch rate dropped maybe 5%. Probably worth it for high-volume teams.

What I'd do differently

If I were starting today, I'd build this on top of the Azure DevOps MCP Server with a GitHub Copilot agent rather than rolling the REST plumbing myself. The wiring is cleaner and you get the diffing + comment-posting for free. The only reason we still run the pipeline version is that we already have it and it works.

I would NOT add a "rate limit by author" feature, even though every team asks for it. The point of the bot is unblinking consistency. Letting individual contributors mute it defeats the purpose.

Azure OpenAIAzure DevOpsAI Review

Conversation

Reactions & comments

Liked this? Tap a reaction. Want to push back, share a war story, or ask a follow-up? Drop a comment below — replies are threaded and markdown works.

Loading conversation…

More from DevOps

See all →