Does TurboQuant Hurt Search Quality? We Compressed 20× to Find Out
- 13 hours ago
- 5 min read
When Google's TurboQuant made the rounds — "extreme vector compression with near-optimal distortion," a memory-chip-stock wobble, an academic spat with the RaBitQ authors — one nagging question stayed unanswered by the headlines:
If you actually run it on a real corpus with real relevance judgments, does search quality survive?
So our team built the experiment, end to end, on a single laptop. No cloud, no API keys, no hand-waving. Here's what we found, and the five things we learned along the way — including a fairness mistake we almost shipped.
The setup: everything local, nothing hidden
We wanted results we could trust and anyone could reproduce, so the whole stack runs on one machine:
Piece | Choice |
Vector DB | Milvus v2.6.13 (Docker) |
Embeddings | Ollama + qwen3-embedding:4b → 2560-dim vectors |
Corpora | BEIR — NFCorpus (3,633 docs) + SciFact (5,183 docs), with ground-truth qrels |
Hardware | Apple M5 Pro, 15 cores, 48 GB unified memory, Metal |
Two corpora on purpose: NFCorpus is densely judged (~38 relevant docs/query), which makes the quality metrics stable enough to detect small changes; SciFact is the clean canonical sanity check.
We measured quality two different ways, because they answer different questions:
Qrels-based quality (nDCG@10, Recall@100): did we still retrieve the humanly relevant documents?
ANN recall (vs. exact search): did quantization change which documents we retrieve, compared to brute-force full-precision search? This is the sensitive needle.
What TurboQuant actually does
It's a data-oblivious quantizer — no training, no learned codebooks, no pass over your data.
(1) Randomly rotate each vector so its coordinates become near-Gaussian and independent; (2) apply a per-coordinate scalar quantizer that's near-optimal for that known distribution; (3) optionally add a 1-bit residual sketch to sharpen the inner-product estimate.
The same rotation hits queries and documents, so ranking is preserved. We implemented a faithful reference from the paper (fast Walsh–Hadamard rotation + Lloyd–Max quantizer + 1-bit residual) and swept 1–4 bits per coordinate.
The headline: recall barely moves
Here's full precision vs. TurboQuant across the bit sweep:
NFCorpus
Config | Compression | nDCG@10 | Recall@100 | ANN recall@100 |
Full precision | 1× | 0.4019 | 0.3687 | 1.000 |
TurboQuant b1 (mse) | 19.8× | 0.3987 | 0.3655 | 0.862 |
TurboQuant b1 (prod) | 9.9× | 0.4006 | 0.3703 | 0.925 |
TurboQuant b4 (prod) | 4.0× | 0.4012 | 0.3670 | 0.985 |
SciFact
Config | Compression | nDCG@10 | Recall@100 | ANN recall@100 |
Full precision | 1× | 0.7730 | 0.9733 | 1.000 |
TurboQuant b1 (mse) | 19.8× | 0.7662 | 0.9700 | 0.883 |
TurboQuant b3 (prod) | 5.0× | 0.7747 | 0.9733 | 0.978 |
TurboQuant b4 (prod) | 4.0× | 0.7731 | 0.9733 | 0.988 |
At ~20× compression, nDCG@10 moves by ~0.003–0.007. The quality curve is essentially flat:

That's the finding in one picture: you can throw away ~95% of the bytes and the ranking quality barely flinches. And because the quantizer is data-oblivious, encoding the entire corpus took under a second — no training step at all.
Lesson 1 — nDCG looked too good, which is itself a warning
On SciFact, TurboQuant b3 scored nDCG@10 0.7747 — higher than full precision's 0.7730. Quantization doesn't add information; a quantized index beating the original is the tell that these differences are noise at 300 queries, not real gains.
The honest read isn't "quantization improved search" — it's "statistically indistinguishable from full precision." We report it that way, and flag that we ran no formal significance tests.
This is why we leaned on ANN recall as the real signal.
Lesson 2 — the 1-bit residual earns its keep
nDCG was too flat to separate the variants, but ANN recall (agreement with exact search) tells a clean story: it climbs with the bit budget, and the "prod" variant's 1-bit residual lifts recall at every budget.

One extra bit per coordinate buys a real, consistent jump on both datasets. The algorithm's second stage isn't decoration.
Lesson 3 — the fairness trap we almost fell into
We also benchmarked Milvus's native quantizers. Our first cut had IVF_PQ looking terrible — nDCG@10 collapsing to 0.29 on NFCorpus and 0.52 on SciFact. Easy headline: "PQ is bad, TurboQuant wins."
It would have been wrong. We'd left PQ at m=16 — 16 bytes per vector, a ~640× squeeze, while TurboQuant b1 sits around 320–516 bytes. We were comparing a method at 640× against methods at ~32×. Apples to anvils.
Re-running PQ at m=320 (320 bytes/vector — byte-comparable) changed the picture completely: nDCG@10 jumped to 0.3920 / 0.7477. Now it's genuinely competitive.


At a comparable ~320-byte budget, the ANN-recall ranking is:
Method | Bytes/vec | ANN recall@100 (NFCorpus / SciFact) |
TurboQuant b1 (mse) | ~516 | 0.862 / 0.883 |
Milvus IVF_RABITQ | 320 | 0.842 / 0.870 |
Milvus IVF_PQ (m=320) | 320 | 0.820 / 0.812 |
Milvus IVF_SQ8 | 2,560 | 0.984 / 0.986 (8× the bytes) |
Milvus IVF_PQ (m=16) | 16 | 0.490 / 0.507 (640× squeeze) |
Lesson: a compression benchmark without an equal-bytes axis is marketing, not measurement.
Lesson 4 — the bottleneck was the boring part
We expected quantization or indexing to dominate. They didn't:
Quantize the whole corpus: ~1 second (no training).
Milvus index build: 3–5 seconds.
Embedding the corpus: ~15–20 minutes at ~4 docs/sec on the 4B model.
In a real local RAG stack, your time and energy go into embedding, not compression. Quantization is essentially free.
Lesson 5 — the honest caveat, and what it means for you
TurboQuant's vector-search superiority is contested — there's an active priority dispute with the RaBitQ authors, and its strongest uncontested results are for LLM KV-cache, not ANN search. Our numbers agree with the skeptics: TurboQuant is strong, not a step-change, for retrieval.
And here's the practical punchline: Milvus's IVF_RABITQ — an independent, production-grade rotation-plus-1-bit quantizer — lands within a couple of points of our from-scratch TurboQuant (ANN@100 0.842/0.870 vs 0.862/0.883).
That's reassuring two ways: it says the reference isn't a fluke, and it says you don't need to wait for an official TurboQuant — the same ~32× memory win is one index parameter away in Milvus today (IVF_RABITQ, or IVF_SQ8 if you can spare the bytes for near-lossless recall).
The takeaway
For a local RAG / vector-search stack, TurboQuant-style scalar-plus-residual quantization cuts vector memory ~10–20× with negligible loss in retrieval quality, for free, with no training. Whether you reach for a research implementation or just flip on Milvus's native IVF_RABITQ, the memory math is now firmly in your favor — as long as you remember to compare at equal bytes.
This is the kind of grounded, reproducible evaluation we do at Shorthills AI when we design retrieval and RAG systems for production — measuring what actually matters, at honest budgets, before it ships.
Reproduce it yourself
Full code, the 14-page PDF report, and all the logs are here:
uv venv --python 3.12
uv pip install pymilvus numpy scipy psutil pyyaml tabulate ollama "beir>=2.0" "markitdown[all]"
ollama pull qwen3-embedding:4b
docker compose up -d
./run_all.sh
python -m common.summaryBuilt and benchmarked on an Apple M5 Pro. Everything ran locally — no document or vector left the machine.



Comments