Does TurboQuant Hurt Search Quality? We Compressed 20× to Find Out

Jun 17
5 min read

When Google's TurboQuant made the rounds — "extreme vector compression with near-optimal distortion," a memory-chip-stock wobble, an academic spat with the RaBitQ authors — one nagging question stayed unanswered by the headlines:

If you actually run it on a real corpus with real relevance judgments, does search quality survive?

So our team built the experiment, end to end, on a single laptop. No cloud, no API keys, no hand-waving. Here's what we found, and the five things we learned along the way — including a fairness mistake we almost shipped.

The setup: everything local, nothing hidden

We wanted results we could trust and anyone could reproduce, so the whole stack runs on one machine:

Piece	Choice
Vector DB	Milvus v2.6.13 (Docker)
Embeddings	Ollama + qwen3-embedding:4b → 2560-dim vectors
Corpora	BEIR — NFCorpus (3,633 docs) + SciFact (5,183 docs), with ground-truth qrels
Hardware	Apple M5 Pro, 15 cores, 48 GB unified memory, Metal

Two corpora on purpose: NFCorpus is densely judged (~38 relevant docs/query), which makes the quality metrics stable enough to detect small changes; SciFact is the clean canonical sanity check.

We measured quality two different ways, because they answer different questions:

Qrels-based quality (nDCG@10, Recall@100): did we still retrieve the humanly relevant documents?
ANN recall (vs. exact search): did quantization change which documents we retrieve, compared to brute-force full-precision search? This is the sensitive needle.

What TurboQuant actually does

It's a data-oblivious quantizer — no training, no learned codebooks, no pass over your data.

(1) Randomly rotate each vector so its coordinates become near-Gaussian and independent; (2) apply a per-coordinate scalar quantizer that's near-optimal for that known distribution; (3) optionally add a 1-bit residual sketch to sharpen the inner-product estimate.

The same rotation hits queries and documents, so ranking is preserved. We implemented a faithful reference from the paper (fast Walsh–Hadamard rotation + Lloyd–Max quantizer + 1-bit residual) and swept 1–4 bits per coordinate.

The headline: recall barely moves

Here's full precision vs. TurboQuant across the bit sweep:

NFCorpus

Config	Compression	nDCG@10	Recall@100	ANN recall@100
Full precision	1×	0.4019	0.3687	1.000
TurboQuant b1 (mse)	19.8×	0.3987	0.3655	0.862
TurboQuant b1 (prod)	9.9×	0.4006	0.3703	0.925
TurboQuant b4 (prod)	4.0×	0.4012	0.3670	0.985

SciFact

Config	Compression	nDCG@10	Recall@100	ANN recall@100
Full precision	1×	0.7730	0.9733	1.000
TurboQuant b1 (mse)	19.8×	0.7662	0.9700	0.883
TurboQuant b3 (prod)	5.0×	0.7747	0.9733	0.978
TurboQuant b4 (prod)	4.0×	0.7731	0.9733	0.988

At ~20× compression, nDCG@10 moves by ~0.003–0.007. The quality curve is essentially flat:

That's the finding in one picture: you can throw away ~95% of the bytes and the ranking quality barely flinches. And because the quantizer is data-oblivious, encoding the entire corpus took under a second — no training step at all.

Lesson 1 — nDCG looked too good, which is itself a warning

On SciFact, TurboQuant b3 scored nDCG@10 0.7747 — higher than full precision's 0.7730. Quantization doesn't add information; a quantized index beating the original is the tell that these differences are noise at 300 queries, not real gains.

The honest read isn't "quantization improved search" — it's "statistically indistinguishable from full precision." We report it that way, and flag that we ran no formal significance tests.

This is why we leaned on ANN recall as the real signal.

Lesson 2 — the 1-bit residual earns its keep

nDCG was too flat to separate the variants, but ANN recall (agreement with exact search) tells a clean story: it climbs with the bit budget, and the "prod" variant's 1-bit residual lifts recall at every budget.

One extra bit per coordinate buys a real, consistent jump on both datasets. The algorithm's second stage isn't decoration.

Lesson 3 — the fairness trap we almost fell into

We also benchmarked Milvus's native quantizers. Our first cut had IVF_PQ looking terrible — nDCG@10 collapsing to 0.29 on NFCorpus and 0.52 on SciFact. Easy headline: "PQ is bad, TurboQuant wins."

It would have been wrong. We'd left PQ at m=16 — 16 bytes per vector, a ~640× squeeze, while TurboQuant b1 sits around 320–516 bytes. We were comparing a method at 640× against methods at ~32×. Apples to anvils.

Re-running PQ at m=320 (320 bytes/vector — byte-comparable) changed the picture completely: nDCG@10 jumped to 0.3920 / 0.7477. Now it's genuinely competitive.

At a comparable ~320-byte budget, the ANN-recall ranking is:

Method	Bytes/vec	ANN recall@100 (NFCorpus / SciFact)
TurboQuant b1 (mse)	~516	0.862 / 0.883
Milvus IVF_RABITQ	320	0.842 / 0.870
Milvus IVF_PQ (m=320)	320	0.820 / 0.812
Milvus IVF_SQ8	2,560	0.984 / 0.986 (8× the bytes)
Milvus IVF_PQ (m=16)	16	0.490 / 0.507 (640× squeeze)

Lesson: a compression benchmark without an equal-bytes axis is marketing, not measurement.

Lesson 4 — the bottleneck was the boring part

We expected quantization or indexing to dominate. They didn't:

Quantize the whole corpus: ~1 second (no training).
Milvus index build: 3–5 seconds.
Embedding the corpus: ~15–20 minutes at ~4 docs/sec on the 4B model.

In a real local RAG stack, your time and energy go into embedding, not compression. Quantization is essentially free.

Lesson 5 — the honest caveat, and what it means for you

TurboQuant's vector-search superiority is contested — there's an active priority dispute with the RaBitQ authors, and its strongest uncontested results are for LLM KV-cache, not ANN search. Our numbers agree with the skeptics: TurboQuant is strong, not a step-change, for retrieval.

And here's the practical punchline: Milvus's IVF_RABITQ — an independent, production-grade rotation-plus-1-bit quantizer — lands within a couple of points of our from-scratch TurboQuant (ANN@100 0.842/0.870 vs 0.862/0.883).

That's reassuring two ways: it says the reference isn't a fluke, and it says you don't need to wait for an official TurboQuant — the same ~32× memory win is one index parameter away in Milvus today (IVF_RABITQ, or IVF_SQ8 if you can spare the bytes for near-lossless recall).

The takeaway

For a local RAG / vector-search stack, TurboQuant-style scalar-plus-residual quantization cuts vector memory ~10–20× with negligible loss in retrieval quality, for free, with no training. Whether you reach for a research implementation or just flip on Milvus's native IVF_RABITQ, the memory math is now firmly in your favor — as long as you remember to compare at equal bytes.

This is the kind of grounded, reproducible evaluation we do at Shorthills AI when we design retrieval and RAG systems for production — measuring what actually matters, at honest budgets, before it ships.

Reproduce it yourself

Full code, the 14-page PDF report, and all the logs are here:

→ github.com/shorthills-ai/turboquant-vector-search-experiment

uv venv --python 3.12
uv pip install pymilvus numpy scipy psutil pyyaml tabulate ollama "beir>=2.0" "markitdown[all]"

ollama pull qwen3-embedding:4b

docker compose up -d

./run_all.sh

python -m common.summary

Built and benchmarked on an Apple M5 Pro. Everything ran locally — no document or vector left the machine.