top of page

Does TurboQuant Hurt Search Quality? We Compressed 20× to Find Out

  • 13 hours ago
  • 5 min read

When Google's TurboQuant made the rounds — "extreme vector compression with near-optimal distortion," a memory-chip-stock wobble, an academic spat with the RaBitQ authors — one nagging question stayed unanswered by the headlines:


If you actually run it on a real corpus with real relevance judgments, does search quality survive?


So our team built the experiment, end to end, on a single laptop. No cloud, no API keys, no hand-waving. Here's what we found, and the five things we learned along the way — including a fairness mistake we almost shipped.


The setup: everything local, nothing hidden


We wanted results we could trust and anyone could reproduce, so the whole stack runs on one machine:


Piece

Choice

Vector DB

Milvus v2.6.13 (Docker)

Embeddings

Ollama + qwen3-embedding:4b → 2560-dim vectors

Corpora

BEIR — NFCorpus (3,633 docs) + SciFact (5,183 docs), with ground-truth qrels

Hardware

Apple M5 Pro, 15 cores, 48 GB unified memory, Metal

Two corpora on purpose: NFCorpus is densely judged (~38 relevant docs/query), which makes the quality metrics stable enough to detect small changes; SciFact is the clean canonical sanity check.


We measured quality two different ways, because they answer different questions:


  1. Qrels-based quality (nDCG@10, Recall@100): did we still retrieve the humanly relevant documents?

  2. ANN recall (vs. exact search): did quantization change which documents we retrieve, compared to brute-force full-precision search? This is the sensitive needle.


What TurboQuant actually does


It's a data-oblivious quantizer — no training, no learned codebooks, no pass over your data.


(1) Randomly rotate each vector so its coordinates become near-Gaussian and independent; (2) apply a per-coordinate scalar quantizer that's near-optimal for that known distribution; (3) optionally add a 1-bit residual sketch to sharpen the inner-product estimate.


The same rotation hits queries and documents, so ranking is preserved. We implemented a faithful reference from the paper (fast Walsh–Hadamard rotation + Lloyd–Max quantizer + 1-bit residual) and swept 1–4 bits per coordinate.


The headline: recall barely moves

Here's full precision vs. TurboQuant across the bit sweep:

NFCorpus

Config

Compression

nDCG@10

Recall@100

ANN recall@100

Full precision

0.4019

0.3687

1.000

TurboQuant b1 (mse)

19.8×

0.3987

0.3655

0.862

TurboQuant b1 (prod)

9.9×

0.4006

0.3703

0.925

TurboQuant b4 (prod)

4.0×

0.4012

0.3670

0.985


SciFact

Config

Compression

nDCG@10

Recall@100

ANN recall@100

Full precision

0.7730

0.9733

1.000

TurboQuant b1 (mse)

19.8×

0.7662

0.9700

0.883

TurboQuant b3 (prod)

5.0×

0.7747

0.9733

0.978

TurboQuant b4 (prod)

4.0×

0.7731

0.9733

0.988


At ~20× compression, nDCG@10 moves by ~0.003–0.007. The quality curve is essentially flat:


That's the finding in one picture: you can throw away ~95% of the bytes and the ranking quality barely flinches. And because the quantizer is data-oblivious, encoding the entire corpus took under a second — no training step at all.


Lesson 1 — nDCG looked too good, which is itself a warning


On SciFact, TurboQuant b3 scored nDCG@10 0.7747higher than full precision's 0.7730. Quantization doesn't add information; a quantized index beating the original is the tell that these differences are noise at 300 queries, not real gains.


The honest read isn't "quantization improved search" — it's "statistically indistinguishable from full precision." We report it that way, and flag that we ran no formal significance tests.

This is why we leaned on ANN recall as the real signal.


Lesson 2 — the 1-bit residual earns its keep


nDCG was too flat to separate the variants, but ANN recall (agreement with exact search) tells a clean story: it climbs with the bit budget, and the "prod" variant's 1-bit residual lifts recall at every budget.


One extra bit per coordinate buys a real, consistent jump on both datasets. The algorithm's second stage isn't decoration.


Lesson 3 — the fairness trap we almost fell into


We also benchmarked Milvus's native quantizers. Our first cut had IVF_PQ looking terrible — nDCG@10 collapsing to 0.29 on NFCorpus and 0.52 on SciFact. Easy headline: "PQ is bad, TurboQuant wins."


It would have been wrong. We'd left PQ at m=1616 bytes per vector, a ~640× squeeze, while TurboQuant b1 sits around 320–516 bytes. We were comparing a method at 640× against methods at ~32×. Apples to anvils.


Re-running PQ at m=320 (320 bytes/vector — byte-comparable) changed the picture completely: nDCG@10 jumped to 0.3920 / 0.7477. Now it's genuinely competitive.




At a comparable ~320-byte budget, the ANN-recall ranking is:

Method

Bytes/vec

ANN recall@100 (NFCorpus / SciFact)

TurboQuant b1 (mse)

~516

0.862 / 0.883

Milvus IVF_RABITQ

320

0.842 / 0.870

Milvus IVF_PQ (m=320)

320

0.820 / 0.812

Milvus IVF_SQ8

2,560

0.984 / 0.986 (8× the bytes)

Milvus IVF_PQ (m=16)

16

0.490 / 0.507 (640× squeeze)

Lesson: a compression benchmark without an equal-bytes axis is marketing, not measurement.


Lesson 4 — the bottleneck was the boring part


We expected quantization or indexing to dominate. They didn't:

  • Quantize the whole corpus: ~1 second (no training).

  • Milvus index build: 3–5 seconds.

  • Embedding the corpus: ~15–20 minutes at ~4 docs/sec on the 4B model.

In a real local RAG stack, your time and energy go into embedding, not compression. Quantization is essentially free.


Lesson 5 — the honest caveat, and what it means for you


TurboQuant's vector-search superiority is contested — there's an active priority dispute with the RaBitQ authors, and its strongest uncontested results are for LLM KV-cache, not ANN search. Our numbers agree with the skeptics: TurboQuant is strong, not a step-change, for retrieval.


And here's the practical punchline: Milvus's IVF_RABITQ — an independent, production-grade rotation-plus-1-bit quantizer — lands within a couple of points of our from-scratch TurboQuant (ANN@100 0.842/0.870 vs 0.862/0.883).


That's reassuring two ways: it says the reference isn't a fluke, and it says you don't need to wait for an official TurboQuant — the same ~32× memory win is one index parameter away in Milvus today (IVF_RABITQ, or IVF_SQ8 if you can spare the bytes for near-lossless recall).


The takeaway


For a local RAG / vector-search stack, TurboQuant-style scalar-plus-residual quantization cuts vector memory ~10–20× with negligible loss in retrieval quality, for free, with no training. Whether you reach for a research implementation or just flip on Milvus's native IVF_RABITQ, the memory math is now firmly in your favor — as long as you remember to compare at equal bytes.


This is the kind of grounded, reproducible evaluation we do at Shorthills AI when we design retrieval and RAG systems for production — measuring what actually matters, at honest budgets, before it ships.


Reproduce it yourself


Full code, the 14-page PDF report, and all the logs are here:


uv venv --python 3.12
uv pip install pymilvus numpy scipy psutil pyyaml tabulate ollama "beir>=2.0" "markitdown[all]"

ollama pull qwen3-embedding:4b

docker compose up -d

./run_all.sh

python -m common.summary

Built and benchmarked on an Apple M5 Pro. Everything ran locally — no document or vector left the machine.

 
 
 

Comments


bottom of page