PR Reviews

Engineering Teams

AI Code Review

How We Rebuilt the Context Layer Behind AI Code Review

Vatsal Bajpai

7 min read·June 14, 2026

By the MatterAI engineering team

Code review is a context problem. A diff, by itself, almost never contains enough information to know whether a change is right. The function being modified is one piece of evidence; the callers, the type definitions, the surrounding patterns in the repo are the rest. A reviewer who doesn't go look at those is mostly guessing.

For a while, our AI code review system was guessing more than we wanted it to. This post is about the rebuild that fixed it — replacing a static "plan-then-search" pipeline with an agentic, LSP-style context loop powered by Axon 2.5 Mini and feeding into Axon 2.5 Pro for the final review.

We'll talk about the architecture, the harness that makes it safe to run an agent in production, and the guardrails we use to keep model output anchored to the repo. We'll keep the proprietary prompt internals to ourselves, but everything about the shape of the system is here.

The old pipeline: Plan + ast-grep

The previous design was simple, in retrospect maybe too simple:

A planner model read the PR diff and emitted a static list of "things to look up" — typically a variables_to_search array of symbols it thought were relevant.
A search executor walked that list once, ran ast-grep over the repo for each entry, and dumped whatever matched into the reviewer's prompt as additional context.
The reviewer model produced the review.

This worked, but the failure modes piled up:

One-shot context. The planner made all its decisions before seeing a single line of repo content. If its guess was wrong — wrong symbol name, wrong language assumption, wrong file — there was no second pass to correct course.
No feedback loop. Even when a search returned something useful (e.g. a function definition with a different signature than what the diff was calling), the reviewer never saw the planner's next thought. The pipeline was a straight pipe, not a loop.
Brittle search surface. ast-grep is excellent at structural matching, but pure pattern search across heterogeneous repos misses a lot. There was no way to say "read just this 30-line slice of this file" or "list the top-level symbols here," which is what a human reviewer actually does.
Hallucination pressure. When the reviewer didn't see enough evidence, it would still produce comments. Some were correct. Some weren't. From a user-trust standpoint, the wrong ones cost us more than the right ones earned.

We were treating "go find the context" as a problem you could solve with one good guess. It isn't. It's a problem that needs iteration, and iteration needs an agent.

The new pipeline: Agentic LSP, two models

The new design splits the work across two Axon models and lets the cheaper one drive a tool-using loop:

┌──────────────┐        ┌────────────────────┐        ┌──────────────────┐
│  PR diff +   │───────▶│   Axon 2.5 Mini    │───────▶│  Context bundle  │
│  cloned repo │        │  (context agent)   │        │  (compact tool   │
└──────────────┘        │                    │        │   call results)  │
                        │  • read_file       │        └────────┬─────────┘
                        │  • grep            │                 │
                        │  • find_definition │                 ▼
                        │  • find_references │        ┌──────────────────┐
                        │  • list_dir        │        │   Axon 2.5 Pro   │
                        │  • outline_symbols │        │   (reviewer)     │
                        └────────────────────┘        └──────────────────┘

Two clean responsibilities:

Axon 2.5 Mini is the context gatherer. It reads the diff, decides what else it needs to see, and emits batches of tool calls. It does not review code. It does not write comments. Its only job is to assemble evidence.
Axon 2.5 Pro is the reviewer. It receives the diff plus the evidence bundle and produces the final code review.

This split is deliberate. Context gathering is a high-volume, latency-sensitive, "many small decisions" workload — exactly what Mini is good at. Reviewing is a single, careful, deeply-reasoned pass — exactly what Pro is good at. Spending Pro tokens on read_file and grep round-trips was burning money on a task Mini does just as well.

The tool surface

We replaced the planner's variables_to_search array with a small set of LSP-style tools, all bound to the cloned repository root:

Tool	Purpose
`read_file`	Read a file, optionally a specific line range. Truncates aggressively.
`grep`	Plain-text search across the repo (defaults to fixed-string, case-insensitive).
`find_definition`	AST-aware lookup of where a symbol is defined.
`find_references`	AST-aware lookup of where a symbol is used (definition sites filtered out).
`list_dir`	List immediate children of a directory.
`outline_symbols`	List the top-level symbols in a file.

A few things to note about this list:

ast-grep didn't go away — it got demoted. It's now the engine behind find_definition, find_references, and outline_symbols. What changed is that the agent decides when to use it, instead of a one-shot planner committing to a guess.
The cheap tools come first. grep is faster and cheaper than read_file, which is cheaper than chasing references. The system prompt for Mini biases it that way.
Every tool is repo-bound and sandboxed. Path arguments are resolved against the cloned repo root and rejected if they escape it. node_modules and .git are skipped. Every shell invocation runs through execFile (never exec) with timeouts and output caps. There's no path on which the agent can "wander out" of the repo.

The tools live in src/cortex/tools/lspTools.ts and the agent that drives them lives in src/cortex/contextAgent.ts. The reviewer entry point in src/clients/aiclient.ts clones the PR's repo, instantiates the tool set, runs the agent, and appends the rendered bundle to the reviewer prompt under a clearly-marked ## Additional repository context section.

The harness: why it's safe to run an agent in production

The single most important thing about productionizing an agent isn't the prompt — it's the harness around it. Agents that "work" in a notebook fall over the first time a model decides to call read_file on the same path 40 times in a loop. Every hard constraint below exists because we hit a version of it during development.

1. Hard caps on everything

Every dimension of the loop is bounded:

Max iterations (default 5) — how many turns Mini gets.
Max tools per iteration (default 8, aim 3–5) — how many tool calls per turn.
Max total tool calls (default 40) — the hard ceiling across the whole loop.
Wall-clock budget (default 30 s) — the loop short-circuits if it runs long.
Bundle byte budget (default 60 KB) — the rendered evidence passed to Pro is capped.

The loop checks the budget after every iteration. If we're over on tool count or over on time, it stops, even mid-plan. Better to ship a slightly thinner bundle than to time out a review entirely.

2. Deduplication via stable keys

Every tool call is canonicalized into a stable string key (tool::sortedArgsJson) and tracked in a Set. If the agent re-requests the same call — same tool, same arguments — we drop it before it ever reaches the executor. This is the single biggest defense against runaway loops, and it costs nothing.

It also implicitly bounds the useful iteration count: even if Mini wants 8 tools per turn for 5 turns (40 calls), repeated calls collapse, so in practice the agent converges or stops well before the ceiling.

3. Parallel batches, sequential turns

Inside a single turn, all tool calls run in parallel (Promise.all) — they're independent reads, so there's no reason to serialize them. Across turns, the loop is strictly sequential, because the next turn's prompt depends on the previous turn's results.

This gives us roughly N× speedup per turn for free, where N is the batch width. It also keeps the model's mental model of the loop simple: "here are the results of everything you asked for last turn."

4. Compaction before re-ingestion

When tool results come back, they don't get fed back into the model verbatim. Each result is compacted down to the smallest representation that still carries the signal:

read_file returns at most 6 KB of content per call.
grep returns at most 30 hits, each truncated to 240 characters.
find_definition / find_references return at most 20 matches per call.
list_dir and outline_symbols are similarly bounded.

A naive "echo the raw tool output back into the prompt" loop balloons context exponentially. Compaction keeps Mini's working context small even at the max iteration × max-tools-per-iter limit.

5. Strict, repairable output format

Mini is required to respond with a single JSON object containing reasoning, tools[], and done. We parse it through a JSON repair helper, then validate each tool entry against the allowed tool list. Anything we don't recognize is silently dropped — the agent never gets to invent a new tool by misspelling one.

When parsing fails entirely, the loop ends gracefully rather than crashing or retrying blindly. The downstream reviewer just sees a smaller (or empty) context bundle and proceeds.

6. A seed step before the first model call

The harness supports an "iteration 0" — a list of tool calls that run before Mini ever sees the diff. We rarely use it now, but it's there for cases where we know exactly what we want first (e.g. always outline the changed files). It costs zero model tokens, and the results land in Mini's context as if it had asked for them itself.

How we keep the output honest

The harness keeps the agent well-behaved. The other half of the problem is keeping the reviewer well-behaved — and that's where most of the anti-hallucination work lives.

We're not going to publish the prompt internals, but here's the shape of what we do:

Anchor every claim to the diff, by line

The reviewer prompt is unusually explicit about diff geometry. Every line in the patch is tagged with both its old ([O#]) and new ([L#]) line numbers. The reviewer is required to derive startLine and endLine for every suggestion strictly from [L#] tags, and any comment whose lines aren't present in the patch is rejected before it ever reaches the user. There's no "we think this is around line 42" — either the line is in the diff or the comment is dropped.

Use the context bundle as a constraint, not a hint

The additional repository context that Mini gathered isn't framed as "here's some background." It's framed as ground truth that the reviewer must reconcile with the diff. If the diff calls functionA(x, y) with y: int, and the context bundle shows functionA declared with y: string, that mismatch is the kind of thing the reviewer is expected to flag. The bundle is evidence, and the reviewer is expected to cite it.

Refuse to suggest when context is insufficient

The reviewer has explicit permission — and instruction — to say "insufficient context to provide a reliable suggestion" rather than guess. This sounds obvious but it's not the default behavior of an LLM under prompt pressure. We reinforce it across the prompt and validate downstream: identical-to-existing suggestions get demoted from suggestion blocks to code examples; suggestions whose line ranges don't exist get dropped; suggestions that alter string literals get rejected.

Separate "what" from "where"

By splitting context gathering (Mini) from review (Pro), the reviewer never has to decide what to look up — that decision is already made, with evidence attached. This dramatically reduces the surface area for "the model imagined a function that doesn't exist," because Pro is only reasoning about code that's literally in its prompt.

Triage before review

Before any of this runs, a lightweight triage pass decides which review categories (bug, security, performance, documentation) actually apply to each file. Files that don't need a category never get prompted for one, which means the reviewer isn't being asked to find a bug in a file that doesn't have one. Fewer prompts to find issues → fewer fabricated issues.

Log everything, replay anything

Every Mini iteration is logged with its full prompt, raw response, and tool results, keyed by (org, owner, repo, prId, iter-N). When a reviewer comment looks suspicious in production, we can replay the exact sequence of tool calls Mini made, see the exact bundle Pro received, and identify whether the problem was a bad lookup, a bad compaction, or a bad reasoning step. Without this, debugging an agent is a guessing game. With it, it's just engineering.

What we got back

A few things changed in measurable ways after the migration:

Comments that cite the rest of the repo. Suggestions now routinely reference function signatures, callers, or surrounding patterns from files outside the diff. Pre-migration, the reviewer essentially had access only to the patch.
Fewer "imagined" symbols. The class of comment that says "you should call helperX here" when helperX doesn't exist in the repo dropped sharply. Mini either finds the symbol or it doesn't, and Pro is only reasoning about what's actually in the bundle.
Better default behavior on unfamiliar repos. The old planner needed the right symbol names guessed in advance. The new agent will list_dir, outline_symbols, and orient itself in repos it's never seen.
Cleaner failure modes. When something does go wrong — clone failure, timeout, oversized repo — the reviewer falls back to a diff-only review rather than crashing. The harness's caps make this graceful, not catastrophic.

There's also a less-visible win: the system is now legible. A reader can follow the loop in contextAgent.ts, the tools in lspTools.ts, and the entry point in aiclient.ts and understand the entire context layer in well under an hour. The old plan-plus-search pipeline had grown a lot of conditional branches that were hard to reason about. Replacing them with an agent + a clean tool surface didn't just improve quality — it improved our ability to keep improving.

What's next

The current loop is intentionally conservative. Several extensions are queued:

Real LSP integration for languages where we want call-hierarchy and type-resolution semantics that ast-grep can't give us cheaply.
Cross-file impact analysis — the agent currently looks "out from the diff." A natural next step is to look "in toward the diff" by finding everything that depends on the changed symbols.
Cached tool results across PRs in the same repo, so consecutive reviews against the same codebase don't re-pay the cost of orienting themselves.
Adaptive iteration limits based on PR complexity — small PRs rarely need more than two turns; large PRs sometimes legitimately need more.

None of these change the shape of the harness. That's the point. The harness — caps, dedupe, compaction, structured output, full logging — is what makes the agent safe to extend. Once you have it, "what should the agent do?" becomes a tractable question. Until you have it, every new capability is a new way to time out a review at 2 a.m.

Takeaways for anyone building something similar

If you're building an LLM-driven context layer for code (or for anything that touches a large, structured corpus), the design choices that mattered most for us were:

Split context gathering from the downstream reasoning. Use a cheap, fast model for the loop. Reserve your strongest model for the single careful pass at the end.
Give the agent tools, not a plan. A static list of lookups can't recover from a wrong guess. An iterative loop can.
Cap everything. Iterations, tools per turn, total tools, wall clock, bundle bytes. Defaults that look paranoid in dev are exactly right in prod.
Dedupe before you execute. Stable keys on every tool call cost nothing and prevent the most common failure mode.
Compact tool results before re-ingestion. The shape of the agent's working context matters more than the size of its model.
Treat the output format as part of the contract. Strict JSON, repaired-then-validated, with unknown fields dropped silently.
Anchor the downstream model to evidence. Make it cite what it saw. Make it refuse when it didn't see enough. Make it impossible to invent line numbers.
Log the whole loop. You will need it.

The agentic LSP is doing the work the old planner was supposed to be doing, but actually doing it — iteratively, with evidence, and within a harness tight enough that we trust it to run on every PR our customers open. That last part is the only part that matters.

MatterAI builds AI engineering intelligence for modern software teams. If you'd like to see the context agent in action on your own pull requests, reach out.

Share this Article:

Introducing Orbital: The low cost AI Coding App Built for Engineers

A full end-to-end alternative to Cursor and Windsurf, powered by Axon LLMs with 2-5x higher usage limits and complete data privacy.

How MatterAI Brings Business Context in Code Reviews to Drive Better Reviews

Discover how MatterAI integrates with Jira and other tools to bring business context into code reviews, enabling more accurate, relevant, and impactful reviews.

Panoptic Thinking

A Graph-Orchestrated Global Reasoning Architecture for Long-Horizon Autonomous Systems

Fixing the $500B problem with today's AI

The key challenges that AI presents today and how we at MatterAI are working on fix them.

LLM Sampling: Engineering Deep Dive

How to tune LLMs to work for you with samplings

Continue Reading

Introducing Orbital: The low cost AI Coding App Built for Engineers

A full end-to-end alternative to Cursor and Windsurf, powered by Axon LLMs with 2-5x higher usage limits and complete data privacy.

How MatterAI Brings Business Context in Code Reviews to Drive Better Reviews

Discover how MatterAI integrates with Jira and other tools to bring business context into code reviews, enabling more accurate, relevant, and impactful reviews.

Panoptic Thinking

A Graph-Orchestrated Global Reasoning Architecture for Long-Horizon Autonomous Systems

Ship Faster. Ship Safer.

Join thousands of engineering teams using MatterAI to autonomously build, review, and deploy code with enterprise-grade precision.

Start Building for Free Read the Docs

No credit card requiredSOC 2 Type IISetup in 2 min

How We Rebuilt the Context Layer Behind AI Code Review

The old pipeline: Plan + ast-grep

The new pipeline: Agentic LSP, two models

The tool surface

The harness: why it's safe to run an agent in production

1. Hard caps on everything

2. Deduplication via stable keys

3. Parallel batches, sequential turns

4. Compaction before re-ingestion

5. Strict, repairable output format

6. A seed step before the first model call

How we keep the output honest

Anchor every claim to the diff, by line

Use the context bundle as a constraint, not a hint

Refuse to suggest when context is insufficient

Separate "what" from "where"

Triage before review

Log everything, replay anything

What we got back

What's next

Takeaways for anyone building something similar

More Articles

Introducing Orbital: The low cost AI Coding App Built for Engineers

How MatterAI Brings Business Context in Code Reviews to Drive Better Reviews

Panoptic Thinking

Fixing the $500B problem with today's AI

LLM Sampling: Engineering Deep Dive

Continue Reading

Introducing Orbital: The low cost AI Coding App Built for Engineers

How MatterAI Brings Business Context in Code Reviews to Drive Better Reviews

Panoptic Thinking

Ship Faster. Ship Safer.