What is TurboVec and how does it differ from FAISS?

TurboVec is a Rust-based vector index built on Google Research's TurboQuant algorithm. Unlike FAISS which requires training on your data, TurboVec uses data-oblivious quantization—meaning it compresses vectors without ever looking at your corpus. This eliminates the training step, enables streaming updates, and achieves 2–4× better memory compression (31 GB → 4 GB for 10M vectors) while running 12–20% faster on ARM processors.

What is TurboQuant's key innovation?

TurboQuant's breakthrough is proving you can achieve near-optimal vector compression without training on your data. It normalizes vectors, applies a random rotation that makes coordinates follow a predictable Beta distribution, then uses precomputed Lloyd-Max quantization on those coordinates. This operates within 2.7× of the Shannon information-theoretic lower bound—you literally cannot do meaningfully better at the same bit budget.

When should I use TurboVec instead of a managed vector database?

Use TurboVec when you have 100K–50M vectors, need air-gapped or on-premise deployment, want to reduce infrastructure costs, or require streaming incremental updates without retraining. It's ideal for startups on a budget, healthcare/finance with data privacy requirements, or developers on Apple Silicon. Stick with managed databases if you need distributed search (100M+ vectors), replication, or managed SLAs.

Does TurboVec work with existing RAG frameworks?

Yes. TurboVec has one-line integrations with LangChain, LlamaIndex, and Haystack via pip extras (turbovec[langchain], turbovec[llama-index], turbovec[haystack]). It's designed as a drop-in FAISS replacement—same add/search API, but no training step required.

What are the memory savings with TurboVec?

For 10M vectors at 1,536 dimensions: Float32 baseline requires 61.4 GB, FAISS PQ (4-bit) needs ~7.7 GB, TurboVec (4-bit) uses ~4 GB, and TurboVec (2-bit) compresses to ~2 GB. That's 15–30× compression from float32, meaning a 10M-document RAG system fits on a Mac mini with 16 GB RAM.

Google TurboVec: Compress 10M Vectors from 31GB to | explainx.ai Blog

AI has a memory problem nobody talks about enough.

You fine-tune the model, deploy the API, and ship the product — and then the vector database bills arrive. A RAG pipeline over 10 million documents needs 31 GB of RAM just for the index. That's before your embedding server, your API layer, your caches, or your LLM inference. At scale, vector memory becomes the largest single line item in your AI infrastructure budget.

Google just shipped an answer: TurboVec.

Built on TurboQuant — Google Research's vector quantization algorithm presented at ICLR 2026 — TurboVec is an open-source vector index written in Rust with Python bindings that compresses that same 10-million-document corpus into 4 GB without sacrificing retrieval quality. And it does it while searching faster than FAISS.

This is not a marginal optimization. This is a fundamental rethinking of how vector search should work.

Part I: The Problem with Vector Search at Scale

Why Vector Databases Are Expensive

Every RAG system, semantic search engine, AI agent, and recommendation system ultimately depends on the same primitive: approximate nearest neighbor (ANN) search over high-dimensional embedding vectors.

The workflow is simple:

Embed your documents into float32 vectors (typically 1,536–3,072 dimensions for modern embedding models)
Store those vectors in an index
At query time, embed the question and find the most similar vectors in the index

The problem is step 2. A single float32 vector at 1,536 dimensions is 6,144 bytes — about 6 KB. Multiply that by 10 million documents and you're at 61.4 GB in raw storage. Optimized indexes like FAISS reduce this, but a FAISS IndexFlatL2 still requires the full float32 representation: ~31 GB for 10M vectors at 1,536 dimensions.

This creates real-world constraints:

Cost: A dedicated machine with 32–64 GB RAM runs $500–$2,000/month on major cloud providers
Latency: Large indexes don't fit in L3 cache, killing search latency
Private deployment: Most organizations can't afford dedicated vector infrastructure for on-prem AI
Consumer hardware: Running local RAG on a MacBook with a 10M-document knowledge base is simply impossible

The industry response has been product quantization (PQ) — compressing vectors into smaller codes. FAISS ships PQ variants (IndexPQ, IndexPQFastScan) that work reasonably well. But they have a catch: they require training data.

Before you can index anything, PQ must analyze your corpus to build a "codebook" — a learned clustering of the vector space. New data can break the codebook. Changing your embedding model requires rebuilding everything from scratch.

The Status Quo: Product Quantization's Hidden Costs

Traditional PQ involves:

Training phase (offline): Run k-means clustering on a representative sample of your corpus to learn sub-space centroids
Encoding phase: Encode each vector by finding the nearest centroid in each sub-space
Search phase: Score compressed codes against a pre-computed lookup table

The training step adds complexity and latency to every pipeline change. If your corpus is dynamic — news articles, user documents, live data streams — PQ codebooks age poorly. You're constantly balancing index freshness against rebuild cost.

What if you could skip training entirely?

Part II: TurboQuant — The Algorithm Behind TurboVec

TurboQuant is a data-oblivious quantizer developed by researchers at Google Research and New York University, published at ICLR 2026 (arXiv:2504.19874). The paper proves that you can achieve near-optimal compression without ever looking at your data.

The key insight is a mathematical property of high-dimensional geometry.

How TurboQuant Works

Step 1: Normalize

Strip the length (norm) from each vector and store it as a single float. Now every vector is a unit direction on the high-dimensional hypersphere. Norms are tiny — one float per vector is negligible overhead.

Step 2: Random rotation

Multiply every vector by the same randomly-generated orthogonal matrix. This is a critical step. After rotation, a remarkable thing happens: every coordinate independently follows a Beta distribution that converges to Gaussian N(0, 1/d) as dimensionality increases.

This distribution is the same regardless of your input data. It doesn't matter if you're indexing medical papers, e-commerce products, legal contracts, or code. Once you rotate, the coordinate distribution is predictable from math alone.

Step 3: Lloyd-Max scalar quantization

Since the coordinate distribution is known analytically, you can precompute the optimal way to bucket each coordinate — the bucket boundaries and centroids that minimize mean squared error. This is the Lloyd-Max algorithm, applied not to empirical data but to the Beta distribution's closed-form statistics.

For 2-bit quantization: 4 buckets per coordinate. For 4-bit quantization: 16 buckets per coordinate.

These are computed once, hardcoded into the library. Zero data passes. Zero training time.

Step 4: Bit-pack

Each coordinate becomes a small integer. Pack them tightly into bytes.

A 1,536-dim vector goes from:

Float32: 6,144 bytes
2-bit TurboQuant: 384 bytes (16x compression)
4-bit TurboQuant: 768 bytes (8x compression)

Search

At query time, rotate the query vector once into the same compressed domain. Score directly against codebook values using SIMD kernels — no decompression required.

The paper proves TurboQuant achieves distortion within 2.7× of the Shannon information-theoretic lower bound — you literally cannot do meaningfully better with any quantizer for a given bit budget.

Why Data-Oblivious Quantization Is a Big Deal

For practitioners, the implications are profound:

python

# Traditional PQ workflow
index = faiss.IndexPQ(dim, M, nbits)
index.train(training_vectors)  # ← requires training data, minutes/hours
index.add(all_vectors)

# TurboVec workflow
from turbovec import TurboQuantIndex

index = TurboQuantIndex(dim=1536, bit_width=4)
index.add(vectors)  # ← no training step, just add

No training means:

Incremental updates: Add vectors one at a time without rebuilding
Cold start: New corpus? Zero warmup time
Model changes: Switch embedding models without retraining the index
Streaming pipelines: Index live data as it arrives

Part III: TurboVec — Rust Implementation with Python Bindings

TurboVec is the open-source implementation of TurboQuant as a production-grade vector index, written by Ryan Codrai. It ships as both a Rust crate and a Python package, and integrates natively with LangChain, LlamaIndex, and Haystack.

Installation

bash

# Python
pip install turbovec

# Rust
cargo add turbovec

Basic Python Usage

python

from turbovec import TurboQuantIndex
import numpy as np

# Create an index for 1536-dim vectors (e.g., OpenAI text-embedding-3-small)
index = TurboQuantIndex(dim=1536, bit_width=4)

# Add vectors — no training required
vectors = np.random.randn(10_000_000, 1536).astype(np.float32)
index.add(vectors)

# Search
query = np.random.randn(1, 1536).astype(np.float32)
scores, indices = index.search(query, k=10)

# Persist to disk
index.write("my_index.tq")

# Load later
loaded = TurboQuantIndex.load("my_index.tq")

Stable IDs with Deletes

For production use cases where documents are updated or deleted, TurboVec provides IdMapIndex — a wrapper that maps your external IDs to internal indices and supports O(1) deletes:

python

from turbovec import IdMapIndex
import numpy as np

index = IdMapIndex(dim=1536, bit_width=4)

# Add with your external IDs (e.g., database primary keys)
vectors = np.random.randn(1000, 1536).astype(np.float32)
doc_ids = np.array([1001, 1002, ..., 2000], dtype=np.uint64)
index.add_with_ids(vectors, doc_ids)

# Search returns your external IDs
scores, ids = index.search(query, k=10)
print(ids)  # [1047, 1312, ...]

# Delete a document — no rebuild needed
index.remove(1312)

# Persist
index.write("my_index.tvim")

Rust Usage

rust

use turbovec::TurboQuantIndex;

let mut index = TurboQuantIndex::new(1536, 4);
index.add(&vectors);
let results = index.search(&queries, 10);
index.write("index.tv").unwrap();

let loaded = TurboQuantIndex::load("index.tv").unwrap();

Framework Integrations

TurboVec plugs into the major RAG frameworks with one-line installs:

bash

pip install turbovec[langchain]    # LangChain integration
pip install turbovec[llama-index]  # LlamaIndex integration
pip install turbovec[haystack]     # Haystack integration

python

# LangChain drop-in
from turbovec.langchain import TurboVecVectorStore
from langchain_openai import OpenAIEmbeddings

vectorstore = TurboVecVectorStore(
    embedding=OpenAIEmbeddings(),
    dim=1536,
    bit_width=4
)
vectorstore.add_documents(documents)
docs = vectorstore.similarity_search("your query", k=5)

Part IV: Benchmarks — Memory, Speed, and Recall

Memory Compression

The headline number: 31 GB → 4 GB for 10 million 1,536-dim float32 vectors at 4-bit quantization.

Configuration	Memory (10M vectors, d=1536)	Compression
Float32 (raw)	61.4 GB	1×
FAISS IndexFlatL2	61.4 GB	1×
FAISS IndexPQFastScan (4-bit)	~7.7 GB	~8×
TurboVec (4-bit)	~4.0 GB	~15×
TurboVec (2-bit)	~2.0 GB	~30×

The compression gains over FAISS PQ come from TurboQuant's more efficient bit-packing and the fact that no codebook storage is required.

Search Speed — ARM (Apple M3 Max)

TurboVec uses hand-written NEON intrinsics for ARM processors, with a nibble-split lookup table approach for maximum throughput.

Benchmarks: 100K vectors, 1K queries, k=64, median of 5 runs.

Config	TurboVec (single-thread)	FAISS FastScan	Speedup
d=1536, 4-bit	faster	baseline	+12–20%
d=3072, 4-bit	faster	baseline	+12–20%
d=1536, 2-bit	faster	baseline	+12–20%

On ARM, TurboVec beats FAISS IndexPQFastScan across every configuration tested.

Search Speed — x86 (Intel Xeon Platinum 8481C / Sapphire Rapids)

TurboVec uses AVX-512BW kernels on modern x86 processors, with an AVX2 fallback for older hardware. Runtime feature detection via is_x86_feature_detected! — no recompilation needed.

Config	TurboVec vs FAISS
4-bit, single-thread	+1–6% (wins)
4-bit, multi-thread	+1–6% (wins)
2-bit, single-thread	within ~1% (ties)
2-bit, multi-thread	-2–4% (narrow loss)

The 2-bit multi-thread loss on x86 is a known limitation — the inner accumulate loop is too short for unrolling amortization to match FAISS's AVX-512 VBMI path. For most production workloads (4-bit is the recommended default), TurboVec is competitive or better across the board.

Recall Quality

TurboQuant vs FAISS IndexPQ (LUT256, nbits=8) — 100K vectors, k=64.

On OpenAI embeddings (d=1536, d=3072):

TurboQuant and FAISS are within 0–1 point at R@1
Both converge to 1.0 by k=4–8

On GloVe (d=200 — a harder, lower-dimensional regime):

TurboQuant trails FAISS by 3–6 points at R@1 at very low bit-widths
Closes by k≈16–32

Bottom line: For modern embedding models at 1,536+ dimensions, TurboQuant matches FAISS quality while using 2–4× less memory and running faster. For very low-dimensional embeddings, FAISS PQ maintains a quality edge.

Part V: Building a Local RAG Pipeline with TurboVec

Here's a complete example of a fully local, air-gapped RAG system using TurboVec — no managed services, no cloud APIs, no data leaving your machine.

Setup

bash

pip install turbovec[langchain] sentence-transformers langchain

Full Pipeline

python

import numpy as np
from pathlib import Path
from sentence_transformers import SentenceTransformer
from turbovec import TurboQuantIndex

class LocalRAG:
    def __init__(
        self,
        embedding_model: str = "all-MiniLM-L6-v2",
        bit_width: int = 4,
        index_path: str = "knowledge_base.tq"
    ):
        self.embedder = SentenceTransformer(embedding_model)
        self.dim = self.embedder.get_sentence_embedding_dimension()
        self.bit_width = bit_width
        self.index_path = index_path
        self.documents = []

        # Load existing index or create fresh
        if Path(index_path).exists():
            self.index = TurboQuantIndex.load(index_path)
            print(f"Loaded existing index ({len(self.documents)} docs)")
        else:
            self.index = TurboQuantIndex(dim=self.dim, bit_width=bit_width)
            print(f"Created new index (dim={self.dim}, {bit_width}-bit)")

    def ingest(self, texts: list[str], batch_size: int = 512):
        """Embed and index documents in batches."""
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            embeddings = self.embedder.encode(
                batch,
                normalize_embeddings=True,
                show_progress_bar=True
            ).astype(np.float32)

            start_id = len(self.documents)
            self.index.add(embeddings)
            self.documents.extend(batch)

            print(f"Indexed {min(i + batch_size, len(texts))}/{len(texts)}")

        self.index.write(self.index_path)
        print(f"Saved index to {self.index_path}")

    def search(self, query: str, k: int = 5) -> list[dict]:
        """Retrieve top-k most relevant documents."""
        query_vec = self.embedder.encode(
            [query],
            normalize_embeddings=True
        ).astype(np.float32)

        scores, indices = self.index.search(query_vec, k=k)

        results = []
        for score, idx in zip(scores[0], indices[0]):
            if idx < len(self.documents):
                results.append({
                    "text": self.documents[idx],
                    "score": float(score),
                    "index": int(idx)
                })

        return results


# Usage
rag = LocalRAG(embedding_model="all-MiniLM-L6-v2", bit_width=4)

# Ingest your corpus
with open("knowledge_base.txt") as f:
    documents = [line.strip() for line in f if line.strip()]

rag.ingest(documents)

# Query
results = rag.search("What is the capital of France?", k=5)
for r in results:
    print(f"[{r['score']:.4f}] {r['text'][:100]}")

Memory Footprint Comparison

For a 10M-document corpus with 1,536-dim embeddings:

snippet

Float32 baseline:  61.4 GB RAM
FAISS FlatL2:      61.4 GB RAM
FAISS PQ (4-bit):  ~7.7 GB RAM   ← requires training phase
TurboVec (4-bit):  ~4.0 GB RAM   ← zero training, instant add
TurboVec (2-bit):  ~2.0 GB RAM   ← fits on a Mac mini

A Mac mini M4 with 16 GB RAM can now serve a 10M-document RAG system entirely in memory, with room left over for the embedding model and LLM inference.

Part VI: The TurboQuant Paper — Technical Depth

For practitioners who want to understand the theory before trusting the library, here's the math at a workable depth.

The Information-Theoretic Bound

Any lossy compression scheme for vectors trades off distortion against rate (bits per dimension). Shannon's rate-distortion theory tells us the theoretical minimum distortion achievable at a given bit budget. No algorithm can beat it — but many algorithms fall far short.

PQ-family methods (including FAISS PQ) are data-dependent: they fit the quantizer to the distribution of your specific corpus. This adaptation helps, but it introduces training cost and corpus lock-in.

TurboQuant's insight: for unit vectors in high dimensions, the post-rotation coordinate distribution is universal. You don't need data to characterize it. The Beta distribution that emerges after random rotation is the same regardless of what corpus you embed.

This means the Lloyd-Max quantizer can be derived from first principles once, baked into the library, and applied forever without retraining.

The Two Stages

TurboQuant has a two-stage heritage:

PolarQuant (AISTATS 2026): The random rotation stage that induces the predictable Beta distribution on coordinates
QJL (Quantized Johnson-Lindenstrauss) (companion paper): A 1-bit residual correction that recovers inner-product accuracy after quantization

Together they achieve the near-optimal distortion bound. The ICLR 2026 paper proves TurboQuant operates within a factor of ≈2.7 of the Shannon limit across all bit-widths and dimensions — meaning you're not leaving meaningful quality on the table.

Why ARM > x86 for TurboVec

The scoring kernel is where TurboVec's performance advantage lives. The key operation is: given a compressed query and compressed database vectors, compute approximate inner products as fast as possible.

TurboVec uses nibble-split lookup tables — the 4-bit code for each dimension is split into two 2-bit halves, each of which is scored against a precomputed 4-entry table. This maps perfectly onto NEON's vtbl instruction on ARM, which does 8 parallel table lookups in a single cycle.

On x86, AVX-512BW provides a similar vpshufb instruction. TurboVec's AVX-512 kernel adapts FAISS FastScan's pack layout and u16 accumulator strategy, which is why x86 performance is closely competitive with FAISS rather than blowing past it.

The ARM advantage comes from M-series chips' high NEON throughput and the fact that FAISS's x86 path is extremely well-tuned while its ARM path has historically received less attention.

Part VII: Production Deployment Patterns

Pattern 1: Drop-In FAISS Replacement

If you're already on FAISS, TurboVec is designed as a drop-in:

python

# Before — FAISS
import faiss
index = faiss.IndexPQFastScan(dim, M, nbits)
index.train(training_data)
index.add(vectors)
D, I = index.search(query, k)

# After — TurboVec (no training step, better memory)
from turbovec import TurboQuantIndex
index = TurboQuantIndex(dim=dim, bit_width=4)
index.add(vectors)  # same vectors, no training
scores, I = index.search(query, k=k)

Pattern 2: Air-Gapped Enterprise Deployment

For regulated industries (healthcare, finance, government), TurboVec's local-only architecture is a hard requirement:

python

# All processing stays on your hardware
from turbovec import TurboQuantIndex

# No API calls, no managed services
index = TurboQuantIndex(dim=1536, bit_width=4)
index.add(your_proprietary_embeddings)
results = index.search(query_embedding, k=10)

# Persist to encrypted volume
index.write("/mnt/encrypted/knowledge_base.tq")

Pattern 3: Streaming Ingestion

TurboVec's zero-training architecture enables true streaming updates:

python

from turbovec import IdMapIndex
from kafka import KafkaConsumer
import json

index = IdMapIndex(dim=1536, bit_width=4)

consumer = KafkaConsumer("document-embeddings")
for message in consumer:
    doc = json.loads(message.value)
    embedding = np.array(doc["embedding"], dtype=np.float32)
    doc_id = np.uint64(doc["id"])

    if doc.get("deleted"):
        index.remove(doc_id)  # O(1) delete
    else:
        index.add_with_ids(embedding.reshape(1, -1), np.array([doc_id]))
    
    # Checkpoint periodically
    if message.offset % 10_000 == 0:
        index.write("index.tvim")

Pattern 4: Memory-Constrained Edge Deployment

For edge devices, IoT gateways, or Raspberry Pi deployments:

python

# 2-bit mode: 30x compression from float32
# Fits 1M docs in ~400 MB — viable for edge devices

from turbovec import TurboQuantIndex

index = TurboQuantIndex(dim=384, bit_width=2)  # smaller embedding model too
index.add(corpus_embeddings)

# 1M × 384-dim vectors = 1.5 GB float32 → ~50 MB at 2-bit
# Runs on a Raspberry Pi 5 (8GB model)

Filtering at Search Time

TurboVec supports search-time filtering to restrict results to a subset of documents:

python

from turbovec import IdMapIndex

index = IdMapIndex(dim=1536, bit_width=4)

# Add documents with metadata (tracked externally)
for doc_id, embedding, metadata in documents:
    index.add_with_ids(embedding, np.array([doc_id]))
    doc_metadata[doc_id] = metadata

# Filter to only approved documents at search time
approved_ids = get_approved_document_ids(user_context)
scores, ids = index.search(query, k=100, filter_ids=approved_ids)

# Get top-k after filtering
results = [(scores[i], ids[i]) for i in range(len(ids)) if ids[i] in approved_ids][:10]

Part VIII: Implications for AI Infrastructure

The Efficiency Inflection Point

The AI industry has spent five years competing on scale: bigger models, larger context windows, more GPUs, bigger data centers. TurboVec represents a quieter but equally important trend: doing more with less.

The math matters here. A 92% memory reduction doesn't just make existing systems cheaper — it changes what's architecturally possible:

Before TurboVec	After TurboVec
10M docs requires dedicated 32GB server	10M docs fits on a MacBook Pro
Private RAG needs $500/month cloud VM	Private RAG runs on local hardware
Real-time index updates require careful PQ retraining	Streaming updates with zero rebuild cost
Filtering requires over-fetch + post-filter	Native search-time filtering
Air-gap deployment = small knowledge base	Air-gap deployment = production-scale knowledge base

What Changes for RAG Architectures

The standard RAG pattern has been:

Chunk documents
Embed chunks
Store in managed vector database (Pinecone, Weaviate, Qdrant, etc.)
Pay $50–500/month for the service

TurboVec makes a compelling case for self-hosted vector search at scale:

snippet

Managed vector DB (10M docs): ~$100–500/month
TurboVec on a VPS with 8GB RAM: ~$20–40/month
TurboVec on existing infrastructure: $0/month

This doesn't mean managed vector databases disappear — they offer persistence, replication, filtering, and operational simplicity that TurboVec doesn't provide out of the box. But for teams that already operate infrastructure and value data privacy, the calculus shifts.

Implications for LLM KV Cache

While TurboVec targets vector search, TurboQuant's paper covers a broader application: KV cache quantization for large language model inference.

Attention's KV cache is itself a matrix of vectors — one per token, per layer, per head. At 128K context windows on large models, the KV cache alone can consume 10–20 GB of GPU memory.

TurboQuant achieves:

Absolute quality neutrality at 3.5 bits per channel
Marginal quality degradation at 2.5 bits per channel
6× memory reduction with at least 6× faster attention on NVIDIA H100

This means TurboQuant could compress the KV cache of a model running at 128K context from ~16 GB to ~2.7 GB while maintaining generation quality — dramatically expanding what fits in a given GPU budget.

What It Means for the Democratization of AI

The most underappreciated consequence of TurboVec is what it does for accessibility.

Right now, the teams who can run serious RAG systems are the ones who can afford serious infrastructure. A 10M-document knowledge base over proprietary company data — legal documents, customer records, internal wikis — costs real money to query at scale.

With TurboVec, a two-person startup can run that same system on a $40/month VPS, with data never leaving their infrastructure. A research lab can run 100M-document corpus experiments on a single workstation. A healthcare startup can build an air-gapped clinical knowledge base without HIPAA-constrained cloud infrastructure.

Part IX: Limitations and Honest Tradeoffs

TurboVec is impressive, but it's not a universal replacement for all vector search infrastructure. Here's what to keep in mind:

Low-Dimensional Embeddings

The theoretical guarantees of TurboQuant rely on the Beta distribution approximation holding in high dimensions. For embeddings below ~256 dimensions (like GloVe d=200), the approximation is looser:

At R@1, TurboQuant trails FAISS PQ by 3–6 points for d=200
The gap closes by k≈16–32

If you're using older, smaller embedding models, test recall carefully before deploying TurboVec in production.

No Distributed Mode

TurboVec is a single-node library. It doesn't provide:

Replication
Sharding across multiple machines
High-availability failover
Multi-tenant isolation

For massive corpora (>1B vectors) or high-availability requirements, managed vector databases or distributed systems like Milvus remain necessary. TurboVec is best for single-machine workloads — which, given 4 GB for 10M vectors, covers a lot of ground.

No HNSW

TurboVec implements flat quantized search (exhaustive search over compressed vectors), not HNSW (Hierarchical Navigable Small World graphs). HNSW offers sub-linear search time and is better for very large corpora with strict latency SLAs.

At 10M vectors, flat quantized search is fast enough for most applications. At 100M+ vectors, you'd want HNSW or IVF indexing on top of TurboQuant — which may come in future releases.

2-bit Multi-Thread x86 Regression

As noted in the benchmarks, TurboVec is 2–4% slower than FAISS on 2-bit multi-threaded x86 workloads. If you're on x86 and need maximum multi-threaded throughput at 2-bit, benchmark carefully. The 4-bit configuration is recommended for x86 production deployments.

Part X: Getting Started — Practical Checklist

Here's a decision tree for evaluating TurboVec for your use case:

Should You Switch to TurboVec?

Use TurboVec if:

✅ Your corpus is 100K–50M vectors
✅ You want to reduce infrastructure costs
✅ You need air-gapped or on-premise deployment
✅ Your corpus grows incrementally (streaming updates)
✅ You're on ARM hardware (Apple Silicon, AWS Graviton)
✅ You care about data privacy and local-first architecture
✅ You embed at 512+ dimensions (modern embedding models)

Stick with your current solution if:

❌ You need distributed/replicated vector search
❌ Your corpus exceeds 100M vectors
❌ You use low-dimensional embeddings (d < 256) where recall is critical
❌ You need HNSW for sub-linear search time
❌ You require managed SLAs and operations team support

Quick Performance Test

Before migrating, run this recall check against your own data:

python

import numpy as np
from turbovec import TurboQuantIndex
import faiss

dim = 1536
n = 100_000

# Generate or use your real embeddings
vectors = np.random.randn(n, dim).astype(np.float32)
faiss.normalize_L2(vectors)
queries = np.random.randn(100, dim).astype(np.float32)
faiss.normalize_L2(queries)

# Ground truth from exact search
flat_index = faiss.IndexFlatIP(dim)
flat_index.add(vectors)
_, gt = flat_index.search(queries, k=10)

# TurboVec at 4-bit
tv_index = TurboQuantIndex(dim=dim, bit_width=4)
tv_index.add(vectors)
_, tv_results = tv_index.search(queries, k=10)

# Compute recall@10
recall = np.mean([
    len(set(tv_results[i]) & set(gt[i])) / 10
    for i in range(len(queries))
])
print(f"TurboVec recall@10: {recall:.4f}")
# Expected: 0.90+ for d=1536

Conclusion: Efficiency as the New Frontier

The past five years of AI progress have been measured in parameters. The next five may be measured in efficiency.

TurboVec is a clear signal that the compression frontier is just as important as the capability frontier. Google Research proved that you can derive near-optimal vector quantization from mathematics alone — no data, no training, no codebooks. Ryan Codrai built that into a production-grade Rust library with Python bindings that ships tomorrow.

The headline numbers are striking:

31 GB → 4 GB for 10 million vectors
Zero training required
12–20% faster than FAISS on ARM
Air-gap friendly — no data leaves your infrastructure

But the deeper implication is about access. The organizations that can now run serious vector search aren't just the ones with $10,000/month infrastructure budgets. They're the two-person startup on a VPS, the research lab on a workstation, the healthcare company that can't put patient data in the cloud.

Vector search at scale is no longer an enterprise-only capability.

Resources

Library:

TurboVec GitHub — source code, benchmarks, and docs
PyPI: turbovec — pip install turbovec
API Reference

Papers:

Framework Integrations:

AI has a memory problem nobody talks about enough.

Google just shipped an answer: TurboVec.

This is not a marginal optimization. This is a fundamental rethinking of how vector search should work.

Part I: The Problem with Vector Search at Scale

Why Vector Databases Are Expensive

The workflow is simple:

Embed your documents into float32 vectors (typically 1,536–3,072 dimensions for modern embedding models)
Store those vectors in an index
At query time, embed the question and find the most similar vectors in the index

This creates real-world constraints:

Cost: A dedicated machine with 32–64 GB RAM runs $500–$2,000/month on major cloud providers
Latency: Large indexes don't fit in L3 cache, killing search latency
Private deployment: Most organizations can't afford dedicated vector infrastructure for on-prem AI
Consumer hardware: Running local RAG on a MacBook with a 10M-document knowledge base is simply impossible

The Status Quo: Product Quantization's Hidden Costs

Traditional PQ involves:

Training phase (offline): Run k-means clustering on a representative sample of your corpus to learn sub-space centroids
Encoding phase: Encode each vector by finding the nearest centroid in each sub-space
Search phase: Score compressed codes against a pre-computed lookup table

What if you could skip training entirely?

Part II: TurboQuant — The Algorithm Behind TurboVec

The key insight is a mathematical property of high-dimensional geometry.

How TurboQuant Works

Step 1: Normalize

Step 2: Random rotation

Step 3: Lloyd-Max scalar quantization

For 2-bit quantization: 4 buckets per coordinate. For 4-bit quantization: 16 buckets per coordinate.

These are computed once, hardcoded into the library. Zero data passes. Zero training time.

Step 4: Bit-pack

Each coordinate becomes a small integer. Pack them tightly into bytes.

A 1,536-dim vector goes from:

Float32: 6,144 bytes
2-bit TurboQuant: 384 bytes (16x compression)
4-bit TurboQuant: 768 bytes (8x compression)

Search

At query time, rotate the query vector once into the same compressed domain. Score directly against codebook values using SIMD kernels — no decompression required.

Why Data-Oblivious Quantization Is a Big Deal

For practitioners, the implications are profound:

python

# Traditional PQ workflow
index = faiss.IndexPQ(dim, M, nbits)
index.train(training_vectors)  # ← requires training data, minutes/hours
index.add(all_vectors)

# TurboVec workflow
from turbovec import TurboQuantIndex

index = TurboQuantIndex(dim=1536, bit_width=4)
index.add(vectors)  # ← no training step, just add

No training means:

Incremental updates: Add vectors one at a time without rebuilding
Cold start: New corpus? Zero warmup time
Model changes: Switch embedding models without retraining the index
Streaming pipelines: Index live data as it arrives

Part III: TurboVec — Rust Implementation with Python Bindings

Installation

bash

# Python
pip install turbovec

# Rust
cargo add turbovec

Basic Python Usage

python

from turbovec import TurboQuantIndex
import numpy as np

# Create an index for 1536-dim vectors (e.g., OpenAI text-embedding-3-small)
index = TurboQuantIndex(dim=1536, bit_width=4)

# Add vectors — no training required
vectors = np.random.randn(10_000_000, 1536).astype(np.float32)
index.add(vectors)

# Search
query = np.random.randn(1, 1536).astype(np.float32)
scores, indices = index.search(query, k=10)

# Persist to disk
index.write("my_index.tq")

# Load later
loaded = TurboQuantIndex.load("my_index.tq")

Stable IDs with Deletes

For production use cases where documents are updated or deleted, TurboVec provides IdMapIndex — a wrapper that maps your external IDs to internal indices and supports O(1) deletes:

python

from turbovec import IdMapIndex
import numpy as np

index = IdMapIndex(dim=1536, bit_width=4)

# Add with your external IDs (e.g., database primary keys)
vectors = np.random.randn(1000, 1536).astype(np.float32)
doc_ids = np.array([1001, 1002, ..., 2000], dtype=np.uint64)
index.add_with_ids(vectors, doc_ids)

# Search returns your external IDs
scores, ids = index.search(query, k=10)
print(ids)  # [1047, 1312, ...]

# Delete a document — no rebuild needed
index.remove(1312)

# Persist
index.write("my_index.tvim")

Rust Usage

rust

use turbovec::TurboQuantIndex;

let mut index = TurboQuantIndex::new(1536, 4);
index.add(&vectors);
let results = index.search(&queries, 10);
index.write("index.tv").unwrap();

let loaded = TurboQuantIndex::load("index.tv").unwrap();

Framework Integrations

TurboVec plugs into the major RAG frameworks with one-line installs:

bash

pip install turbovec[langchain]    # LangChain integration
pip install turbovec[llama-index]  # LlamaIndex integration
pip install turbovec[haystack]     # Haystack integration

python

# LangChain drop-in
from turbovec.langchain import TurboVecVectorStore
from langchain_openai import OpenAIEmbeddings

vectorstore = TurboVecVectorStore(
    embedding=OpenAIEmbeddings(),
    dim=1536,
    bit_width=4
)
vectorstore.add_documents(documents)
docs = vectorstore.similarity_search("your query", k=5)

Part IV: Benchmarks — Memory, Speed, and Recall

Memory Compression

The headline number: 31 GB → 4 GB for 10 million 1,536-dim float32 vectors at 4-bit quantization.

Configuration	Memory (10M vectors, d=1536)	Compression
Float32 (raw)	61.4 GB	1×
FAISS IndexFlatL2	61.4 GB	1×
FAISS IndexPQFastScan (4-bit)	~7.7 GB	~8×
TurboVec (4-bit)	~4.0 GB	~15×
TurboVec (2-bit)	~2.0 GB	~30×

The compression gains over FAISS PQ come from TurboQuant's more efficient bit-packing and the fact that no codebook storage is required.

Search Speed — ARM (Apple M3 Max)

TurboVec uses hand-written NEON intrinsics for ARM processors, with a nibble-split lookup table approach for maximum throughput.

Benchmarks: 100K vectors, 1K queries, k=64, median of 5 runs.

Config	TurboVec (single-thread)	FAISS FastScan	Speedup
d=1536, 4-bit	faster	baseline	+12–20%
d=3072, 4-bit	faster	baseline	+12–20%
d=1536, 2-bit	faster	baseline	+12–20%

On ARM, TurboVec beats FAISS IndexPQFastScan across every configuration tested.

Search Speed — x86 (Intel Xeon Platinum 8481C / Sapphire Rapids)

TurboVec uses AVX-512BW kernels on modern x86 processors, with an AVX2 fallback for older hardware. Runtime feature detection via is_x86_feature_detected! — no recompilation needed.

Config	TurboVec vs FAISS
4-bit, single-thread	+1–6% (wins)
4-bit, multi-thread	+1–6% (wins)
2-bit, single-thread	within ~1% (ties)
2-bit, multi-thread	-2–4% (narrow loss)

Recall Quality

TurboQuant vs FAISS IndexPQ (LUT256, nbits=8) — 100K vectors, k=64.

On OpenAI embeddings (d=1536, d=3072):

TurboQuant and FAISS are within 0–1 point at R@1
Both converge to 1.0 by k=4–8

On GloVe (d=200 — a harder, lower-dimensional regime):

TurboQuant trails FAISS by 3–6 points at R@1 at very low bit-widths
Closes by k≈16–32

Part V: Building a Local RAG Pipeline with TurboVec

Here's a complete example of a fully local, air-gapped RAG system using TurboVec — no managed services, no cloud APIs, no data leaving your machine.

Setup

bash

pip install turbovec[langchain] sentence-transformers langchain

Full Pipeline

python

import numpy as np
from pathlib import Path
from sentence_transformers import SentenceTransformer
from turbovec import TurboQuantIndex

class LocalRAG:
    def __init__(
        self,
        embedding_model: str = "all-MiniLM-L6-v2",
        bit_width: int = 4,
        index_path: str = "knowledge_base.tq"
    ):
        self.embedder = SentenceTransformer(embedding_model)
        self.dim = self.embedder.get_sentence_embedding_dimension()
        self.bit_width = bit_width
        self.index_path = index_path
        self.documents = []

        # Load existing index or create fresh
        if Path(index_path).exists():
            self.index = TurboQuantIndex.load(index_path)
            print(f"Loaded existing index ({len(self.documents)} docs)")
        else:
            self.index = TurboQuantIndex(dim=self.dim, bit_width=bit_width)
            print(f"Created new index (dim={self.dim}, {bit_width}-bit)")

    def ingest(self, texts: list[str], batch_size: int = 512):
        """Embed and index documents in batches."""
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            embeddings = self.embedder.encode(
                batch,
                normalize_embeddings=True,
                show_progress_bar=True
            ).astype(np.float32)

            start_id = len(self.documents)
            self.index.add(embeddings)
            self.documents.extend(batch)

            print(f"Indexed {min(i + batch_size, len(texts))}/{len(texts)}")

        self.index.write(self.index_path)
        print(f"Saved index to {self.index_path}")

    def search(self, query: str, k: int = 5) -> list[dict]:
        """Retrieve top-k most relevant documents."""
        query_vec = self.embedder.encode(
            [query],
            normalize_embeddings=True
        ).astype(np.float32)

        scores, indices = self.index.search(query_vec, k=k)

        results = []
        for score, idx in zip(scores[0], indices[0]):
            if idx < len(self.documents):
                results.append({
                    "text": self.documents[idx],
                    "score": float(score),
                    "index": int(idx)
                })

        return results


# Usage
rag = LocalRAG(embedding_model="all-MiniLM-L6-v2", bit_width=4)

# Ingest your corpus
with open("knowledge_base.txt") as f:
    documents = [line.strip() for line in f if line.strip()]

rag.ingest(documents)

# Query
results = rag.search("What is the capital of France?", k=5)
for r in results:
    print(f"[{r['score']:.4f}] {r['text'][:100]}")

Memory Footprint Comparison

For a 10M-document corpus with 1,536-dim embeddings:

snippet

Float32 baseline:  61.4 GB RAM
FAISS FlatL2:      61.4 GB RAM
FAISS PQ (4-bit):  ~7.7 GB RAM   ← requires training phase
TurboVec (4-bit):  ~4.0 GB RAM   ← zero training, instant add
TurboVec (2-bit):  ~2.0 GB RAM   ← fits on a Mac mini

A Mac mini M4 with 16 GB RAM can now serve a 10M-document RAG system entirely in memory, with room left over for the embedding model and LLM inference.

Part VI: The TurboQuant Paper — Technical Depth

For practitioners who want to understand the theory before trusting the library, here's the math at a workable depth.

The Information-Theoretic Bound

This means the Lloyd-Max quantizer can be derived from first principles once, baked into the library, and applied forever without retraining.

The Two Stages

TurboQuant has a two-stage heritage:

PolarQuant (AISTATS 2026): The random rotation stage that induces the predictable Beta distribution on coordinates
QJL (Quantized Johnson-Lindenstrauss) (companion paper): A 1-bit residual correction that recovers inner-product accuracy after quantization

Why ARM > x86 for TurboVec

The ARM advantage comes from M-series chips' high NEON throughput and the fact that FAISS's x86 path is extremely well-tuned while its ARM path has historically received less attention.

Part VII: Production Deployment Patterns

Pattern 1: Drop-In FAISS Replacement

If you're already on FAISS, TurboVec is designed as a drop-in:

python

# Before — FAISS
import faiss
index = faiss.IndexPQFastScan(dim, M, nbits)
index.train(training_data)
index.add(vectors)
D, I = index.search(query, k)

# After — TurboVec (no training step, better memory)
from turbovec import TurboQuantIndex
index = TurboQuantIndex(dim=dim, bit_width=4)
index.add(vectors)  # same vectors, no training
scores, I = index.search(query, k=k)

Pattern 2: Air-Gapped Enterprise Deployment

For regulated industries (healthcare, finance, government), TurboVec's local-only architecture is a hard requirement:

python

# All processing stays on your hardware
from turbovec import TurboQuantIndex

# No API calls, no managed services
index = TurboQuantIndex(dim=1536, bit_width=4)
index.add(your_proprietary_embeddings)
results = index.search(query_embedding, k=10)

# Persist to encrypted volume
index.write("/mnt/encrypted/knowledge_base.tq")

Pattern 3: Streaming Ingestion

TurboVec's zero-training architecture enables true streaming updates:

python

from turbovec import IdMapIndex
from kafka import KafkaConsumer
import json

index = IdMapIndex(dim=1536, bit_width=4)

consumer = KafkaConsumer("document-embeddings")
for message in consumer:
    doc = json.loads(message.value)
    embedding = np.array(doc["embedding"], dtype=np.float32)
    doc_id = np.uint64(doc["id"])

    if doc.get("deleted"):
        index.remove(doc_id)  # O(1) delete
    else:
        index.add_with_ids(embedding.reshape(1, -1), np.array([doc_id]))
    
    # Checkpoint periodically
    if message.offset % 10_000 == 0:
        index.write("index.tvim")

Pattern 4: Memory-Constrained Edge Deployment

For edge devices, IoT gateways, or Raspberry Pi deployments:

python

# 2-bit mode: 30x compression from float32
# Fits 1M docs in ~400 MB — viable for edge devices

from turbovec import TurboQuantIndex

index = TurboQuantIndex(dim=384, bit_width=2)  # smaller embedding model too
index.add(corpus_embeddings)

# 1M × 384-dim vectors = 1.5 GB float32 → ~50 MB at 2-bit
# Runs on a Raspberry Pi 5 (8GB model)

Filtering at Search Time

TurboVec supports search-time filtering to restrict results to a subset of documents:

python

from turbovec import IdMapIndex

index = IdMapIndex(dim=1536, bit_width=4)

# Add documents with metadata (tracked externally)
for doc_id, embedding, metadata in documents:
    index.add_with_ids(embedding, np.array([doc_id]))
    doc_metadata[doc_id] = metadata

# Filter to only approved documents at search time
approved_ids = get_approved_document_ids(user_context)
scores, ids = index.search(query, k=100, filter_ids=approved_ids)

# Get top-k after filtering
results = [(scores[i], ids[i]) for i in range(len(ids)) if ids[i] in approved_ids][:10]

Part VIII: Implications for AI Infrastructure

The Efficiency Inflection Point

The math matters here. A 92% memory reduction doesn't just make existing systems cheaper — it changes what's architecturally possible:

Before TurboVec	After TurboVec
10M docs requires dedicated 32GB server	10M docs fits on a MacBook Pro
Private RAG needs $500/month cloud VM	Private RAG runs on local hardware
Real-time index updates require careful PQ retraining	Streaming updates with zero rebuild cost
Filtering requires over-fetch + post-filter	Native search-time filtering
Air-gap deployment = small knowledge base	Air-gap deployment = production-scale knowledge base

What Changes for RAG Architectures

The standard RAG pattern has been:

Chunk documents
Embed chunks
Store in managed vector database (Pinecone, Weaviate, Qdrant, etc.)
Pay $50–500/month for the service

TurboVec makes a compelling case for self-hosted vector search at scale:

snippet

Managed vector DB (10M docs): ~$100–500/month
TurboVec on a VPS with 8GB RAM: ~$20–40/month
TurboVec on existing infrastructure: $0/month

Implications for LLM KV Cache

While TurboVec targets vector search, TurboQuant's paper covers a broader application: KV cache quantization for large language model inference.

Attention's KV cache is itself a matrix of vectors — one per token, per layer, per head. At 128K context windows on large models, the KV cache alone can consume 10–20 GB of GPU memory.

TurboQuant achieves:

Absolute quality neutrality at 3.5 bits per channel
Marginal quality degradation at 2.5 bits per channel
6× memory reduction with at least 6× faster attention on NVIDIA H100

What It Means for the Democratization of AI

The most underappreciated consequence of TurboVec is what it does for accessibility.

Part IX: Limitations and Honest Tradeoffs

TurboVec is impressive, but it's not a universal replacement for all vector search infrastructure. Here's what to keep in mind:

Low-Dimensional Embeddings

The theoretical guarantees of TurboQuant rely on the Beta distribution approximation holding in high dimensions. For embeddings below ~256 dimensions (like GloVe d=200), the approximation is looser:

At R@1, TurboQuant trails FAISS PQ by 3–6 points for d=200
The gap closes by k≈16–32

If you're using older, smaller embedding models, test recall carefully before deploying TurboVec in production.

No Distributed Mode

TurboVec is a single-node library. It doesn't provide:

Replication
Sharding across multiple machines
High-availability failover
Multi-tenant isolation

No HNSW

At 10M vectors, flat quantized search is fast enough for most applications. At 100M+ vectors, you'd want HNSW or IVF indexing on top of TurboQuant — which may come in future releases.

2-bit Multi-Thread x86 Regression

Part X: Getting Started — Practical Checklist

Here's a decision tree for evaluating TurboVec for your use case:

Should You Switch to TurboVec?

Use TurboVec if:

✅ Your corpus is 100K–50M vectors
✅ You want to reduce infrastructure costs
✅ You need air-gapped or on-premise deployment
✅ Your corpus grows incrementally (streaming updates)
✅ You're on ARM hardware (Apple Silicon, AWS Graviton)
✅ You care about data privacy and local-first architecture
✅ You embed at 512+ dimensions (modern embedding models)

Stick with your current solution if:

❌ You need distributed/replicated vector search
❌ Your corpus exceeds 100M vectors
❌ You use low-dimensional embeddings (d < 256) where recall is critical
❌ You need HNSW for sub-linear search time
❌ You require managed SLAs and operations team support

Quick Performance Test

Before migrating, run this recall check against your own data:

python

import numpy as np
from turbovec import TurboQuantIndex
import faiss

dim = 1536
n = 100_000

# Generate or use your real embeddings
vectors = np.random.randn(n, dim).astype(np.float32)
faiss.normalize_L2(vectors)
queries = np.random.randn(100, dim).astype(np.float32)
faiss.normalize_L2(queries)

# Ground truth from exact search
flat_index = faiss.IndexFlatIP(dim)
flat_index.add(vectors)
_, gt = flat_index.search(queries, k=10)

# TurboVec at 4-bit
tv_index = TurboQuantIndex(dim=dim, bit_width=4)
tv_index.add(vectors)
_, tv_results = tv_index.search(queries, k=10)

# Compute recall@10
recall = np.mean([
    len(set(tv_results[i]) & set(gt[i])) / 10
    for i in range(len(queries))
])
print(f"TurboVec recall@10: {recall:.4f}")
# Expected: 0.90+ for d=1536

Conclusion: Efficiency as the New Frontier

The past five years of AI progress have been measured in parameters. The next five may be measured in efficiency.

The headline numbers are striking:

31 GB → 4 GB for 10 million vectors
Zero training required
12–20% faster than FAISS on ARM
Air-gap friendly — no data leaves your infrastructure

Vector search at scale is no longer an enterprise-only capability.

Resources

Library:

TurboVec GitHub — source code, benchmarks, and docs
PyPI: turbovec — pip install turbovec
API Reference

Papers:

Framework Integrations: