polargrep: 2,167× faster regex search with sparse n-gram indexing

Polarity Labs · Research Division04.06.26

ripgrep is a marvel of engineering: memory-mapped I/O, SIMD-accelerated matching through the regex crate, and a pile of carefully tuned heuristics. It is also, unavoidably, linear. To answer a query it opens every file, scans every byte, and checks for a match. On Chromium’s 489,684 files a single search takes 9.3 seconds. That is fine for a human who searches a few times an hour and ruinous for an agent firing dozens of queries per task.

The asymmetry is the point. A human search is a foreground action; an agent search is an inner-loop primitive. If retrieval costs seconds, the agent either waits or searches less, and searching less is how agents miss things. We wanted regex search that an agent could call as freely as a function.

Inverted indexes

The classic escape from linear scan is the inverted index. For every unique term in the corpus you precompute the list of documents that contain it, the term’s posting list. At query time you decompose the pattern into terms, intersect their posting lists, and only ever open the files that could possibly match, which is typically far fewer than the full corpus. The work moves from query time to index time, where it can be amortized.

Trigram decomposition

Regex is not a bag of words, so the unit we index is the trigram, every 3-byte sequence in the source. A pattern is broken into the trigrams its literal runs must contain; character classes and wildcards act as barriers, and each literal segment contributes its trigrams independently. Three bytes is the sweet spot. Bigrams are too permissive, they match nearly everything, while 4-grams demand longer literal runs than real queries reliably provide.

Probabilistic masks

Trigram intersection still leaves false positives: files that contain the right trigrams but never adjacent, in the order the pattern requires. Rather than widen the index, we attach two tiny bloom filters to each posting entry, a next_mask summarizing the byte that follows the trigram and a loc_mask capturing pairwise adjacency. That is just two extra bytes per entry, and it lets us reject most non-matches without ever touching the file.

Mask width is the lever. Going from 8-bit to 16-bit masks cuts the false-positive rate roughly 8.5× for rare trigrams, and the gains multiply across a query: combine several trigrams and the compound false-positive rate drops by more than 5,000×. The cost is two bytes.

Figure 1. False-positive rate vs. the number of trigrams in a query, by mask width. 16-bit masks drive the compound rate toward zero; the gap widens with every additional trigram.

The formula behind the curve is the standard bloom-filter false-positive rate, (1 − e^(−kn/m))^k, evaluated for each mask width and the empirical distribution of following bytes. Widening m from 8 to 16 bits moves every term in that expression in the right direction at once, which is why the improvement is so lopsided for rare trigrams.

Sparse n-grams and the on-disk index

Indexing every trigram is wasteful, since most are redundant with their neighbors. Instead we select variable-length n-grams (2–8 bytes) at frequency-weighted peaks, points where an n-gram’s weight exceeds its neighbors, which cuts index entries by 30–40% without losing query correctness, because queries still decompose to trigrams only.

The index itself is three flat files plus a fallback list: a sorted index.lookup of hashes and offsets, a packed index.postings array, an index.files table of path offsets, and an index.skipped catalog of oversized files we scan with brute force. That last file is what guarantees 100% recall: anything we choose not to index is still searched directly.

Making it fast

The query path is engineered to avoid work at every step:

Zero-copy posting reads straight off the mmap: no parsing, no copying, no heap allocation.
Smallest-first intersection ordering, so the rarest trigram prunes the candidate set first.
Parallel verification with rayon across the surviving candidates.
A memchr::memmem SIMD fast-path for pure literals, ~6.5× faster than spinning up the regex engine.
Daemon mode that keeps the index resident, dropping per-query overhead from ~8ms to under 5ms.

Scaling to Chromium

Chromium produces about 34 million unique n-grams, far more than fits in memory during a naive build. So indexing is batched: process the corpus in 20,000-file chunks, emit sorted shards, then stream a K-way merge into the final index. Build time scales close to linearly with the corpus, which keeps even a half-million-file repository tractable to index once and cheap to refresh.

Figure 2. Index build time vs. corpus size. Batched shard-and-merge keeps the build close to linear, so even Chromium indexes in about a minute.

Freshness is a Git commit-hash check: if the index was built against a different commit, we know it is stale and trigger an incremental rebuild. The agent never searches a silently outdated index.

Benchmarks

The result is regex search in single-digit milliseconds on real codebases, with recall identical to ripgrep. The headline is Chromium: kMaxFrameRate across 489,684 files returns in 4.3msversus ripgrep’s 9.3 seconds, a 2,167× speedup.

Query	ripgrep	polargrep	Speedup
kMaxFrameRate · Chromium	9.3s	4.3ms	2,167×
quantized_decomposed · PyTorch	557ms	1.6ms	348×
class\s+\w+Observer · Chromium	188ms	9.1ms	20×
torch · PyTorch (common)	41ms	18ms	2.2×

Figure 3. Search latency, polargrep vs. ripgrep, on representative queries. Recall is identical to ripgrep; only latency changes.

Speedup tracks pattern rarity. A rare query is three orders of magnitude faster because the index prunes the corpus to almost nothing; a common term like torch only sees ~2.2× because so many files genuinely match that verification dominates and there is little left to skip. On a log axis the relationship is clean and monotone.

Figure 4. Speedup over ripgrep by query (log scale). The rarer the pattern, the more the index saves.

Beyond Cursor’s Instant Grep

Several pieces go past what a straightforward trigram index (or Cursor’s Instant Grep) does:

16-bit probabilistic masks with dual hashing, cutting false positives by 3.9× to over 5,000×.
loc_mask pairwise-adjacency verification that rejects non-adjacent matches outright.
A memchr SIMD fast-path for pure literals, skipping the regex engine entirely.
Guaranteed 100% recall via skipped-file fallback scanning.
Trigram-only query decomposition, which fixes the context-dependence of sparse n-grams.
Daemon mode that amortizes index loading into sub-5ms per-query latency.

Conclusion

Sparse n-gram indexing with probabilistic masks reduces regex search latency by three orders of magnitude on large codebases, without giving up exact recall. The implementation is in Rust, with reproducible benchmarks, and it turns code search from a foreground action an agent rations into a primitive it can call as freely as any other.