April 6, 2026

polargrep: 2,167x Faster Regex Search with Sparse N-gram Indexing

Polarity Labs, Research Division

Regex search is one of the most common operations in software engineering. When an AI agent needs to understand a codebase, find every call site for a function, trace a type through a monorepo, locate a configuration constant, it runs a regex search. On small projects, ripgrep finishes in milliseconds. On Chromium's 489,684 files, it takes over 9 seconds. Every single time.

We built polargrep to fix this. It uses a sparse n-gram inverted index to find regex matches across half a million files in under 5 milliseconds. This post explains how it works, from the foundational data structures to the optimizations that got us to 2,167x faster than ripgrep, and where we diverge from previous work like Cursor's Instant Grep.

The problem with brute force

ripgrep is the fastest brute-force regex search tool available. It uses memory-mapped I/O, SIMD-accelerated matching via the regex crate, and smart heuristics to skip binary files. On a repo with a few thousand files, it returns results in under a second.

But the fundamental algorithm is still brute force: open every file, scan every byte, check for matches. The work scales linearly with codebase size. On PyTorch (20,931 files), ripgrep takes about 490ms per search. Acceptable for a human typing a query. On Chromium (489,684 files), it takes 9.3 seconds. For an AI agent that might run dozens of searches per task, that's minutes of wall-clock time spent waiting for grep.

The solution is to not read every file. Instead, build an index that tells us which files could possibly match a pattern, and only scan those.

Inverted indexes

The core data structure behind polargrep is an inverted index, the same structure that powers every search engine. For each unique term in your corpus, you store a list of documents that contain it. That list is called a posting list.

In a search engine, terms are words. In polargrep, terms are n-grams, short byte sequences extracted from source code. When you search for a pattern, we decompose it into n-grams, look up which files contain each one, and intersect the posting lists, keeping only files that appear in all of them. The result is a small set of candidate files that we verify with the full regex engine.

The intersection step is what makes this fast. If trigram A appears in 10,000 files and trigram B appears in 8,000 files, but only 47 files contain both, we only need to run the regex engine on 47 files instead of 489,684.

Trigram decomposition

The classic approach, described by Russ Cox in 2012 and originally from Zobel, Moffat, and Sacks-Davis in 1993, uses trigrams: 3-byte sequences. Every file is tokenized into overlapping trigrams, and each trigram gets a posting list in the index.

For example, the string kMaxFrameRate produces these trigrams: kMa, Max, axF, xFr, Fra, ram, ame, meR, eRa, Rat, ate. When you search for that string, we look up all 11 trigrams, intersect their posting lists, and get a small set of candidate files.

Regex patterns get decomposed by walking the HIR (high-level intermediate representation) from the regex-syntax crate. Literal sequences produce trigrams directly. Alternations become OR operations over posting lists. Character classes and wildcards act as barriers, so we extract trigrams from the literal segments on either side.

For a pattern like class\s+\w+Observer, we extract trigrams from "class" (cla, las, ass) and from "Observer" (Obs, bse, ser, erv, rve, ver), skip the \s+\w+ in the middle, and intersect both sets of posting lists. The result is every file containing both literal fragments. Far fewer files than the full corpus.

Why trigrams and not bigrams or 4-grams? Bigrams produce posting lists that are too large. Most bigrams appear in most files, so the intersection barely filters. 4-grams are more selective but require longer literal runs in the query, meaning more regex patterns can't be decomposed at all. Trigrams are the sweet spot.

Probabilistic masks

Trigram intersection gets you from 489,684 files down to maybe a few thousand. To filter further without adding more index entries, we borrow a technique from GitHub's Blackbird project: probabilistic masks.

Each posting list entry stores not just a document ID, but two additional bloom filters:

next_mask: An 8-bit bloom filter encoding which bytes can follow the trigram in this file. If the trigram is kMa and the next byte in the query is x, we hash x to a bit position in the mask. If that bit isn't set, this file can't contain kMax, so we skip it without ever opening the file.

loc_mask: A positional bloom filter encoding where in the file the trigram appears (position mod 8). When two adjacent trigrams in a query must appear at consecutive positions, we check that their loc_masks overlap in a compatible way.

These masks cost just 2 extra bytes per posting entry but eliminate a substantial fraction of false positive candidates. On Chromium, next_mask filtering alone reduced candidate files by 39% for representative patterns.

Beyond 8-bit: 16-bit masks

The standard approach, used by GitHub's Blackbird and described in Cursor's blog, uses 8-bit bloom filters. We extended both masks to 16 bits. The false positive rate improvement depends on how rare the trigram's following bytes are:

Following bytes per trigram8-bit FPR16-bit FPR16-bit advantage
1 (very rare trigram)11.8%1.4%8.5x
331.3%9.8%3.2x
546.5%21.6%2.2x
1071.3%50.9%1.4x
20+ (common)91.8%84.3%~1x (both saturated)

The advantage is largest for rare trigrams, which are exactly the ones that matter most for selective queries. For common trigrams that appear everywhere, both mask widths saturate and neither helps much.

For a multi-trigram query like MAX_FILE_SIZE, the combined false positive rates compound across trigrams:

Rarity8-bit combined16-bit combinedImprovement
1 item/trigram0.02%0.0004%5,246x
5 items/trigram4.66%0.22%21x
10 items/trigram25.9%6.7%3.9x

16-bit masks are dramatically better for rare and selective trigrams (1-5 following bytes per file). The empirical data confirms this: TORCH_INTERNAL_ASSERT (rare trigrams) has only 0.5% FP rate with 16-bit, while common character class patterns saturate at 67% because the index can't filter them at all. The cost is 2 extra bytes per posting entry (6 to 8 bytes), increasing index size by about 33%.

How we calculate false positive rates

The next_mask is a bloom filter. When indexing a file, for each occurrence of a trigram like "MAX", we record which byte follows it by setting bit(s) in the mask. At query time, we check if the expected following byte's bit(s) are set. A false positive happens when the bits are set not because the expected byte actually follows the trigram, but because other bytes hashed to the same bit positions.

The standard bloom filter false positive rate is:

FPR = (1 − ekn/m)k

Where m is the total bits in the filter (8 or 16), k is the number of hash functions (1 for 8-bit, 2 for 16-bit), and n is the number of unique items inserted (unique following bytes).

The derivation: each hash function maps an item to one of m bit positions. After inserting n items with k hash functions, the probability a specific bit is still 0 is (1 - 1/m)^(kn), which approximates to e^(-kn/m). The probability a specific bit is set is therefore 1 - e^(-kn/m). A false positive requires all k hash positions for the query byte to be set, giving us the formula above.

8-bit, 1 hash (byte % 8)

m=8, k=1. If a file has 10 unique bytes following "MAX":

FPR = (1 − e−10/8)1 = 71.3%

A random byte that does not follow "MAX" in this file has a 71.3% chance of passing the filter.

16-bit, 2 hashes (byte % 16 and (byte>>4 ^ byte&0x0F) % 16)

m=16, k=2. Same 10 unique following bytes:

FPR = (1 − e−2×10/16)2 = 50.9%

Two bits must both be set, which is harder to satisfy by chance.

The theoretical formula assumes uniform hashing. In practice, our hash functions aren't perfectly random (byte % 8 has obvious collisions, e.g. 'A'(65) and 'I'(73) both map to bit 1). The number of unique following bytes also varies wildly per trigram per file: common trigrams like "the" might have 50+ unique followers (saturated), while rare trigrams like "ZX_" might have 1-2 (very selective). The combined FPR for a multi-trigram query is the product of per-trigram FPRs, since a file must pass all trigram checks. The empirical FP rates we measured (0.5% to 74%) are total rates across all files, mixing rare and common trigrams together.

Sparse n-grams

Trigrams are a fixed-width window. Every 3-byte sequence in the file gets indexed, producing a lot of redundant entries. The string kMaxFrameRate yields 11 trigrams, but we don't need all of them to identify the files that contain it.

Sparse n-grams, inspired by GitHub's Blackbird and related to the VGRAM research (VLDB 2007), use a frequency-weighted selection function to choose a subset of variable-length n-grams. The idea: common n-grams like "the" or "int" are poor discriminators and appear in nearly every file. Rare n-grams like "xFr" or "ameR" are highly selective. We want to index the rare ones and skip the common ones.

During indexing, we slide a window over each byte sequence and compute a weight for each n-gram using an inverse-frequency table derived from the corpus. At local weight maxima, positions where the n-gram weight exceeds its neighbors, we emit the n-gram into the index. The length varies from 2 to 8 bytes depending on the local frequency landscape.

This produces roughly 30-40% fewer index entries than exhaustive trigram extraction, with better per-entry selectivity. However, we discovered that sparse n-gram queries are unreliable: the peak detection depends on surrounding byte context that may differ between a short query string and a full source file. At query time, we use trigrams only. They're context-independent and guaranteed to be in the index. Sparse n-grams improve index density during build; trigrams ensure query correctness at search time.

The on-disk index

The index uses three core files, designed for fast memory-mapped access with no parsing or deserialization:

On-Disk Index Format

index.lookup

Sorted array of (hash, offset, count) entries. Binary search finds any n-gram in O(log n).

24 bytes/entry · mmap'd
index.postings

Flat array of (doc_id, next_mask, loc_mask). Read directly at offset, zero allocation.

6 bytes/entry · sequential
index.files

Offset table mapping doc IDs to NUL-terminated file paths, packed contiguously.

variable length
Access: hash trigram → binary search lookup → read postings slice from mmap

A fourth file, index.skipped, lists files that exceeded the 10MB size limit during indexing. These are scanned with brute-force regex at query time to guarantee 100% recall. The index never silently drops results.

The access pattern is simple: hash the query trigram, binary search the lookup table, read the posting list slice directly from the mmap'd postings file. No parsing. No copying. No heap allocation.

Making it fast

The index narrows the search from O(n) files to O(k) candidates. But the constants matter. Here's what pushed query times below 10ms.

Zero-copy posting list reads. Early versions allocated a Vec<PostingEntry> for each lookup. We eliminated this entirely. Posting entries are read in-place from the mmap with accessor functions like posting_entry() and posting_doc_id(). Zero allocation during query execution.

Smallest-first intersection. Before intersecting, we estimate the size of each posting list and sort them smallest-first. If any intermediate result is empty, we bail immediately. This makes the common case (rare query terms) a fast path that touches almost no data.

Parallel verification with rayon. Candidate files are verified against the actual regex using rayon's par_iter(). On 16 cores, a 160ms sequential verification pass becomes 10ms.

bytes::Regex. We use regex::bytes::Regex instead of regex::Regex, skipping UTF-8 validation entirely. Source code is overwhelmingly ASCII, and we're matching byte patterns.

memchr for line splitting. SIMD-accelerated newline detection from the memchr crate replaces str::lines(). This matters when scanning large candidate files that passed the index filter.

Daemon mode. Process startup (loading the binary, mapping the index, initializing the regex engine) costs about 5ms. For interactive use, polargrep runs as a long-lived daemon that keeps the index mmap'd and warm. Clients connect over a Unix socket and get results in under 5ms for selective patterns.

Scaling to Chromium

Chromium has 489,684 source files. Indexing produces 34 million unique n-grams, 7.3GB of postings data, and a 785MB lookup table. This doesn't fit in memory during a single indexing pass.

We solve this with batched indexing: process files in chunks of 20,000, producing sorted shard files on disk. Then merge all shards into the final index using a streaming K-way merge with a ShardReader abstraction. Each exhausted shard is deleted progressively, keeping disk usage bounded.

The resulting index is large but the lookup table is mmap'd, so only the pages we touch get loaded into RAM. For selective queries, that's a handful of pages total. The OS page cache does the rest.

Index freshness. polargrep checks the Git commit hash stored at index time against the current HEAD. If they differ, the index is stale and we fall back to brute-force search while rebuilding in the background. This ensures results are always correct, even when the codebase is changing rapidly under an agent's edits.

Benchmarks: PyTorch (20,931 files)

All benchmarks are daemon mode (Unix socket), minimum of 10 warm runs. Times shown are the minimum observed wall-clock time.

Sparse n-gram indexing dramatically
improves regex search speed

1.6ms
polargrep daemon
557ms
ripgrep

Query for quantized_decomposed in PyTorch (20,931 files) on Apple MacBook M4

Patternpolargrep (daemon)ripgrepSpeedup
quantized_decomposed1.6ms557ms348x
TORCH_INTERNAL_ASSERT6.1ms493ms81x
class\s+\w+Module10.7ms497ms46x
def\s+forward25.9ms493ms19x
import.*torch39.1ms459ms12x
Tensor\Variable111ms491ms4.4x
torch221ms497ms2.2x

The pattern is clear: the rarer the query, the bigger the speedup. For very rare strings like quantized_decomposed, the posting list intersection narrows immediately to a handful of files: 348x faster. For common strings like torch that appear in thousands of files, the posting lists are large and verification dominates. Still 2.2x faster than ripgrep, but the index advantage shrinks.

Accuracy is 100% across all 12 tested patterns: zero missed files, zero false positives. Every query returns exactly the same results as ripgrep.

Benchmarks: Chromium (489,684 files)

Chromium is the ultimate stress test. These benchmarks use daemon mode (Unix socket) with the index kept warm in memory.

Client indexing dramatically improves
grep speed

4.3ms
polargrep daemon
33.9ms
polargrep CLI
9.3s
ripgrep

Query for kMaxFrameRate in Chromium (489,684 files) on Apple MacBook M4

The headline number: 4.3ms to find kMaxFrameRate across 489,684 files. That's 2,167x faster than ripgrep's 9.3 seconds.

PatternSocketCLIripgrepSpeedup
kMaxFrameRate (20 files)4.3ms33.9ms9,318ms2,167x
MAX_FILE_SIZE (15 files)7.4ms38.6ms9,225ms1,246x
void\s+OnError (463 files)17.2ms48.1ms9,405ms547x
class\s+\w+Observer (4,233 files)451ms501ms9,213ms20x

Even regex patterns like void\s+OnError complete in 17.2ms, 547x faster. The one case where the speedup is more modest is class\s+\w+Observer, which matches 4,233 files. The posting list intersection produces a large candidate set and verification dominates. But 451ms vs 9.2 seconds is still 20x.

Accuracy remains 100%: every pattern returns exactly the same files as ripgrep across the entire Chromium codebase.

What polargrep does beyond Cursor's Instant Grep

Cursor's blog describes the core architecture: trigram inverted index, sparse n-grams with frequency weights, 8-bit probabilistic masks, and a two-file mmap/disk layout. polargrep implements all of that, then pushes further in several areas that meaningfully improve either filtering quality, query latency, or operational flexibility.

16-bit probabilistic masks with dual hashing

Cursor uses 8-bit bloom filters for next_mask and loc_mask. polargrep expands these to 16 bits with two hash functions (h1 = byte % 16, h2 = (byte >> 4) ^ (byte & 0x0F) % 16). This matters because 8-bit masks saturate quickly. A trigram appearing in a source file with just 10 unique following bytes has a 71.3% false positive rate per trigram with 8-bit/1-hash, but only 50.9% with 16-bit/2-hash. For a 4-trigram query, the combined FPR drops from 25.9% to 6.7%, a 3.9x improvement. For rare trigrams (1-3 following bytes), the improvement is 8.5x per trigram and over 5,000x combined. The cost is 2 extra bytes per posting entry (6 to 8 bytes), which increases index size ~33% but dramatically reduces the number of candidate files that need full regex verification.

loc_mask pairwise adjacency verification

Cursor describes using loc_mask to verify that consecutive trigrams are adjacent (shift and AND). polargrep implements this as a LiteralSequence query plan node. During posting list intersection, loc_masks are collected for all trigrams per candidate document. After intersection, pairwise adjacency is checked between every consecutive trigram pair using rotate_left_16(loc_mask_A, stride) & loc_mask_B != 0, where stride is the byte offset between trigram positions in the query literal. This eliminates files that contain all the required trigrams but not adjacently, for example a file containing both "MAX" and "FILE" but not "MAX_FILE".

memchr::memmem SIMD literal fast-path

For pure literal patterns (no regex metacharacters), polargrep bypasses the regex engine entirely. An is_literal() check detects these patterns and uses memchr::memmem::Finder, which employs SIMD-accelerated substring search (Generic SIMD prefilter processing 16-32 bytes per iteration on AVX2/NEON). This is 6.5x faster than regex::bytes::Regex for known literals because it avoids regex compilation overhead, NFA/DFA construction, and the generality tax of the regex engine's matching loop.

Guaranteed 100% recall via skipped-file fallback

A subtle accuracy problem with index-based search: files exceeding the max file size (10MB) are excluded from the index. Without mitigation, these files would be silently missed, causing recall to drop below ripgrep. polargrep solves this by recording skipped files during indexing in index.skipped. At query time, these files are scanned with brute-force regex in parallel alongside the indexed search. In daemon mode, the skipped list is loaded once at startup (no per-query disk read). This guarantees recall matches ripgrep exactly, verified across 12 patterns on PyTorch and 6 on Chromium with zero misses.

Trigram-only query decomposition (correctness fix)

Sparse n-grams are critical for indexing since they provide high selectivity for long literals. But using them at query time is unsafe. The sparse n-gram boundaries depend on the byte-pair weights of surrounding context, which differs between the isolated query literal and the same literal embedded in a source file. For example, querying for "Module" might generate sparse n-gram "Module" (6 bytes), but in a file containing "VFModule", the peak analysis produces different boundaries. If the query looks up a sparse n-gram hash that wasn't emitted during indexing, the file is incorrectly excluded. polargrep's build_covering() returns only trigrams at query time. Trigrams are context-independent: every overlapping 3-byte window is guaranteed to be in the index regardless of surrounding bytes. This fixes a class of recall bugs while sparse n-grams in the index continue to provide selectivity for the posting lists.

Daemon mode with pre-cached state

CLI process startup on macOS costs ~4.5ms (binary load + dynamic linker + clap init), plus ~3.5ms for fs::canonicalize and mmap'ing the index files. For an agent making dozens of searches per task, this adds up. polargrep's serve command runs a long-lived daemon that keeps the index mmap'd, the skipped-file list cached, and the lookup table pre-warmed via MADV_WILLNEED. Queries arrive over a Unix socket as single-line JSON and results stream back immediately. Measured query time via raw socket: 1.6ms for a 19-file result on PyTorch, 4.3ms for a 20-file result on Chromium (490K files).

Zero-allocation indexing pipeline

The original indexer allocated a Ngram(Vec<u8>) per trigram, roughly 100K heap allocations per 100KB source file. extract_index_entries() computes (hash, doc_id, next_mask, loc_mask) tuples directly via xxh3_64(&text[i..i+3]) with no intermediate allocations. Combined with par_sort_unstable (parallel sort via rayon) and build_parallel() (multi-threaded directory walk), indexing is 45% faster: 8.3s to 4.6s for PyTorch, producing a byte-identical index (verified via MD5).

Batched K-way merge for large repos

Chromium has 490K files producing 34M unique n-grams. Holding all posting entries in memory would require 10+ GB. polargrep processes files in batches of 50K, writes each batch as a sorted shard file, then performs a streaming K-way merge directly into the final postings and lookup files. Exhausted shards are deleted as they're consumed, keeping peak disk usage bounded. This allowed indexing Chromium on a machine with 31GB free disk.

Conclusion

polargrep demonstrates that sparse n-gram indexing with probabilistic masks can reduce regex search latency by three orders of magnitude on large codebases. The key techniques (trigram inverted indexes, bloom filter masks on posting entries, frequency-weighted sparse n-gram selection, batched K-way merge indexing, and daemon mode with mmap'd indexes) combine to deliver sub-5ms search across half a million files with 100% accuracy.

The code is written in Rust. All benchmarks are reproducible.

References

[1] Cursor: Fast regex search: indexing text for agent tools (March 2026)

[2] Zobel, Moffat, Sacks-Davis: Searching Large Lexicons for Partially Specified Terms (1993)

[3] Russ Cox: Regular Expression Matching with a Trigram Index (2012)

[4] VGRAM: Variable-length gram indexing (VLDB 2007)

[5] BitFunnel: Bit-sliced document signatures (SIGIR 2017 Best Paper)

[6] GitHub Blackbird: Sparse n-grams in Rust

[7] Zoekt / Sourcegraph: Positional trigrams

[8] Livegrep: Suffix array search (libdivsufsort)

[9] Tantivy: SIMD bitpacked posting lists in Rust

[10] Hyperscan: Teddy SIMD multi-pattern matching