Polarity

Omnigrep: state-of-the-art agentic code search

Alexandru Ungureanu, Shane Rohan Barakat · Polarity Labs12.09.25

Code search is the quiet bottleneck in agentic software engineering. Before an autonomous agent can fix a failure, answer a question, or implement a feature, it has to find the handful of files and lines that actually matter. Everything else, the planning, the editing, the verification, is gated on that first act of retrieval. By one analysis from Cognition, agent trajectories spend over 60% of their first turn on context retrieval alone.

When that retrieval is wrong, the cost compounds. Recall that is too low means the agent never sees the file it needs and confidently solves the wrong problem. Precision that is too low is just as damaging in a different way: every irrelevant file dumped into the context window crowds out the relevant ones, inflates token cost, and pushes the real signal past the model’s effective attention. Search quality sets the ceiling on everything downstream.

Two paradigms, one trade-off

The field has converged on two families of approach, and each sacrifices the thing the other gets right. Embedding retrieval is fast, a single vector lookup, but it ranks by semantic similarity rather than relevance, so it returns code that is about the right topic without being the right code. LLM agents are precise, because they can read and reason, but they walk the repository one expensive call at a time and are slow enough to break the interactive loop.

Neither resolves the underlying precision/recall tension; they just pick a corner of it. Omnigrep is an attempt to get the precision of an agent at a latency that stays usable, by changing the shape of the loop rather than the size of the model.

Our approach

Omnigrep is a multi-turn agentic loop: four turns, up to eight parallel tool calls per turn, with explicit chain-of-thought reasoning between every turn. The structure mirrors how a strong engineer actually searches. The first turn casts a wide net and gathers evidence from several angles at once. Each subsequent turn reads what came back, forms a hypothesis about where the answer lives, and spends its calls confirming or killing that hypothesis, narrowing toward exact line ranges.

It runs on three deliberately minimal primitives: ripgrep for regex matching, glob for file-system traversal, and read for content. Our ablations show these three are sufficient to cover the full space of code-search subtasks once reasoning orchestrates them; adding embeddings or a custom index did not move the score enough to justify the complexity.

How Omnigrep works

A single query runs through four sequential stages:

  1. Natural-language query.A developer or agent asks something like “where is the authentication middleware defined?”
  2. Discovery turn. Eight parallel calls probe the structure at once, globbing the tree and running ripgrep for terms like middleware and auth.
  3. Chain-of-thought. The model reads the combined results, forms a hypothesis, and plans the next batch of searches from what it now knows.
  4. Refinement turns (2–4). It validates candidates, reads file contents, and narrows to a precise location such as src/auth/middleware.py:42–58.

Why reasoning wins

The reasoning between tool calls is the whole ballgame. It is what lets the model refine its hypotheses across rounds, synthesize context that no single call could surface, plan later calls from accumulated evidence, and recover from a dead end by trying a genuinely different strategy. Remove it and the system collapses into keyword grep with extra steps.

We can measure exactly how much it contributes by adding reasoning back in stages. F0.5 climbs monotonically as the model is allowed to think across more of the loop, with the steepest gains coming from the refinement turns where hypotheses actually get tested.

0.20.30.40.5No reasoningTurn 1Turn 2Turn 3Fullreasoning enabled through turn
Figure 1. F0.5 score as intermediate reasoning is added across the loop. Full reasoning adds +18.7% absolute over no reasoning; removing it entirely degrades F0.5 by 39.4% relative.

Results

On CodeSearchEval (128 tasks across 34 repositories), Omnigrep reaches an F0.5 of 0.475, a 33.1% relative improvement over Claude Sonnet 4.5 (0.357) and 14.9% over Cognition’s RL-specialized SWE-grep (0.413). It clears every general-purpose and embedding baseline we tested, by a wide margin in the cases that matter most.

Omnigrep (ours)0.475SWE-grep · Cognition0.413Claude Sonnet 4.50.357Embedding · rerank0.265GPT-5 Codex0.244
Figure 2. F0.5 on CodeSearchEval (128 tasks, 34 repositories). Higher is better.

The win is concentrated in precision. F0.5 weights precision twice as heavily as recall, by design, because for an agent a clean context window is worth more than an exhaustive one. Omnigrep is roughly 38% more precise than Claude Sonnet 4.5 (0.46 vs 0.33) while holding comparable recall.

PrecisionRecallF0.50.00.20.40.6OmnigrepSWE-grepSonnet 4.5
Figure 3. Precision, recall, and F0.5 by system. Omnigrep’s edge comes from precision at comparable recall.

Speed

Multi-turn search is slower than a single embedding lookup, and we do not pretend otherwise. But at a 17.5s mean it is about 2× faster than Claude Sonnet 4.5 and 17× faster than GPT-5, which keeps it firmly inside the interactive budget. The parallel calls are what buy this: eight searches that would otherwise serialize collapse into one round-trip.

SystemMeanMedian
Embedding (top-5)0.95s0.82s
SWE-grep2.79s2.35s
Omnigrep17.5s15.8s
Claude Sonnet 4.535.9s31.2s
GPT-5 (High)290.6s245.8s
Figure 4. End-to-end latency across systems. Omnigrep is ~2× faster than Claude Sonnet 4.5 and ~17× faster than GPT-5.

Ablations

To check that each design choice is load-bearing, we knocked them out one at a time. Reasoning is the single largest contributor; parallelism and the refinement turns matter too, but less.

ConfigurationF0.5Δ
Full Omnigrep0.475
− parallel calls (serial)0.39−17.9%
− refinement (single turn)0.34−28.4%
− intermediate reasoning0.288−39.4%
Figure 5. F0.5 under ablations. Removing intermediate reasoning is the most damaging single change.

Takeaways

The headline result is architectural, not about scale. A general-purpose LLM with the right reasoning loop beats an RL-specialized search model. Three plain primitives are enough when reasoning orchestrates them, and multi-turn reasoning can exceed specialized training when the architecture is designed well. Better search came from a better loop, not a bigger model, and that is a far cheaper thing to improve.