Skip to Content
TutorialsPrecision Guide

Choosing the Right Vector Precision

Time: 20–30 minLevel: IntermediateOpen In Colab

Every vector database eventually forces you to make a decision nobody warns you about: what numeric precision should your embeddings be stored in? The default answer is usually “float32 — it’s safest.” But is it actually the best choice?

This tutorial runs a controlled benchmark across all five precision modes supported by Endee. Run it on your own corpus to find out which precision fits your speed and recall requirements.

Prerequisites: Endee running locally on http://127.0.0.1:8080


What We Measure

The benchmark holds everything constant except one thing — the storage precision of the index:

  • Corpus: your documents — plug in any text corpus
  • Embedding model: all-MiniLM-L6-v2 (384-dimensional dense vectors)
  • Space type: cosine similarity
  • Query: configurable via QUERY
  • Top-K: configurable via TOP_K
  • Timing: N_RUNS timed queries per precision; warm-up queries discarded to eliminate cold-start bias
  • Recall baseline: float32 top-K is treated as ground truth; every other precision is scored against it

Five separate dense indexes are created — one per precision — with identical vectors upserted into each. Same data, same query, different storage format.

The Five Precisions

PrecisionBits/dimMemory vs float32What it stores
float32321× (baseline)Full IEEE 754 single-precision float
float16162× smallerHalf-precision IEEE 754 float
int16162× smaller16-bit signed integer (linear quantization)
int884× smaller8-bit signed integer (aggressive quantization)
binary132× smallerSingle bit per dimension (Hamming distance)

The memory savings compound fast. A 1-million-vector index at 384 dimensions costs ~1.5 GB in float32. The same index in binary fits in under 50 MB.


Install and Import

endee.constants.Precision is an enum that selects the storage format for each index. SentenceTransformer encodes the corpus and query; time and statistics handle latency measurement.

pip install endee sentence-transformers
import time import statistics from sentence_transformers import SentenceTransformer from endee import Endee from endee.constants import Precision print("Imports OK")

Configuration

TOP_K, N_RUNS, and QUERY are the three knobs that control the benchmark. Increase N_RUNS for more stable timing estimates. Change QUERY to test a different retrieval scenario on your workload.

MODEL_NAME = "all-MiniLM-L6-v2" # 384-dim dense model DENSE_DIM = 384 SPACE_TYPE = "cosine" TOP_K = 10 # results to retrieve per query N_RUNS = 50 # query repetitions for timing # All five precisions to benchmark PRECISIONS = [ ("float32", Precision.FLOAT32), ("float16", Precision.FLOAT16), ("int16", Precision.INT16), ("int8", Precision.INT8), ("binary", Precision.BINARY2), ] INDEX_PREFIX = "bench_precision" # The single query used for every index QUERY = "AI applications in healthcare and medicine"

Define Your Corpus

Replace the DOCUMENTS list with your own corpus. Each document needs an id, a text field, and a meta dict. The benchmark code is dataset-agnostic.

The same documents are upserted into every precision index so the comparison is always fair: only storage precision changes, not the source text or embedding model.

DOCUMENTS = [ {"id": "doc_001", "text": "AI coding assistants boost developer productivity and reduce boilerplate writing", "meta": {"title": "AI Coding Assistants"}}, {"id": "doc_002", "text": "Differential privacy adds calibrated noise to datasets to protect individuals", "meta": {"title": "Differential Privacy"}}, # ... add your own documents here ] assert len(DOCUMENTS) > 0, "DOCUMENTS is empty — add your corpus before running" print(f"Corpus: {len(DOCUMENTS)} documents")

Load Embedding Model and Encode Corpus

All document vectors and the query vector are pre-computed once and cached in doc_vectors. This avoids re-encoding across precision types — the same float32 values are sent to every index, quantized differently by each.

print(f"Loading {MODEL_NAME} ...") model = SentenceTransformer(MODEL_NAME) print(f"Model loaded — dim={model.get_sentence_embedding_dimension()}\n") # Encode every document once; reuse vectors for all precision indexes doc_vectors = { doc["id"]: model.encode(doc["text"]).tolist() for doc in DOCUMENTS } # Encode the benchmark query once query_vec = model.encode(QUERY).tolist() print(f"Encoded {len(doc_vectors)} documents") print(f'Query : "{QUERY}"')

Connect to Endee and Prepare Indexes

One dense index is created per precision. Existing indexes with the same names are deleted first for a clean slate. precision=prec_enum is the only parameter that differs across indexes.

client = Endee() print("Connected to Endee\n") indexes = {} # precision_name -> Index object for prec_name, prec_enum in PRECISIONS: index_name = f"{INDEX_PREFIX}_{prec_name}" # Delete if already exists (clean slate) try: client.delete_index(index_name) print(f" Deleted existing index: {index_name}") except Exception: pass # Create fresh dense index client.create_index( name=index_name, dimension=DENSE_DIM, space_type=SPACE_TYPE, precision=prec_enum, sparse_dim=0, ) print(f" Created {index_name:35s} precision={prec_name}") indexes[prec_name] = client.get_index(index_name) print(f"\n{len(indexes)} indexes ready.")

Upsert Documents into Every Index

The identical float32 payload goes to all five indexes. Endee quantizes the input vectors to the target precision at upsert time — you always send float32; the index handles conversion internally.

BATCH_SIZE = 1000 # Endee max vectors per upsert call for prec_name, _ in PRECISIONS: index = indexes[prec_name] payload = [ { "id": doc["id"], "vector": doc_vectors[doc["id"]], "meta": doc["meta"], "filter": {}, } for doc in DOCUMENTS ] # Upsert in batches of ≤1000 (Endee hard limit) for i in range(0, len(payload), BATCH_SIZE): index.upsert(payload[i : i + BATCH_SIZE]) print(f" Upserted {len(payload)} docs → {index.name}") print("\nAll indexes populated.")

Benchmark — Speed and Recall

Speed: run the same query N_RUNS times per index; record median latency in ms. Recall@K: use float32 as ground truth. Recall = |returned ∩ ground_truth| / K

Three warm-up queries are run before timing begins to eliminate cold-start bias.

results = {} # prec_name -> {latencies, ids, hits, ...} for prec_name, _ in PRECISIONS: index = indexes[prec_name] latencies = [] # Warm-up: 3 un-timed queries to avoid cold-start bias for _ in range(3): index.query(vector=query_vec, top_k=TOP_K) # Timed runs for _ in range(N_RUNS): t0 = time.perf_counter() hits = index.query(vector=query_vec, top_k=TOP_K) t1 = time.perf_counter() latencies.append((t1 - t0) * 1000) # ms results[prec_name] = { "latencies": latencies, "median_ms": statistics.median(latencies), "p95_ms": sorted(latencies)[int(0.95 * N_RUNS)], "ids": [r["id"] for r in hits], "hits": hits, } print( f" {prec_name:<8} " f"median={results[prec_name]['median_ms']:6.2f} ms " f"p95={results[prec_name]['p95_ms']:6.2f} ms " f"top-1={hits[0]['id']} sim={hits[0]['similarity']:.4f}" ) print("\nBenchmark complete.")

Speed order (fastest → slowest): binaryint8int16float16float32

  • binary — fastest. Hamming distance on packed bits uses CPU SIMD popcount — the cheapest distance operation available.
  • int8 — second fastest. 8-bit integer arithmetic has lower overhead than wider integer or floating-point paths.
  • int16 — third. Integer SIMD at 16-bit is faster than floating-point at the same bit width.
  • float16 — fourth. Floating-point overhead at half-precision is lower than float32 but higher than integer paths.
  • float32 — slowest. Full 32-bit IEEE 754 arithmetic is the baseline.

Compute Recall@K vs float32 Ground Truth

Recall@K = |returned ∩ ground_truth| / K. float32 results serve as ground truth. A recall of 1.0 means the quantized index returns the exact same top-K set.

ground_truth = set(results["float32"]["ids"]) print(f"Ground truth (float32) top-{TOP_K}: {sorted(ground_truth)}\n") print(f"{'Precision':<10} {'Recall@'+str(TOP_K):<12} {'Median ms':<12} {'p95 ms':<10} Returned IDs") print("─" * 90) for prec_name, _ in PRECISIONS: returned = set(results[prec_name]["ids"]) recall = len(returned & ground_truth) / len(ground_truth) results[prec_name]["recall"] = recall print( f"{prec_name:<10} {recall:<12.3f} " f"{results[prec_name]['median_ms']:<12.2f} " f"{results[prec_name]['p95_ms']:<10.2f} " f"{results[prec_name]['ids']}" )

Recall order (highest → lowest): float32float16int16int8binary

Summary Table

Consolidates bits per dimension, memory saving, median/p95 latency, speedup vs float32, and Recall@K.

BITS = {"float32": 32, "float16": 16, "int16": 16, "int8": 8, "binary": 1} base_mem = BITS["float32"] base_lat = results["float32"]["median_ms"] print(f"{'Precision':<10} {'Bits':>5} {'Mem saving':>12} {'Median ms':>10} " f"{'p95 ms':>8} {'Speedup':>9} {'Recall@'+str(TOP_K):>10}") print("─" * 72) for prec_name, _ in PRECISIONS: bits = BITS[prec_name] mem_save = f"{base_mem / bits:.1f}×" med = results[prec_name]["median_ms"] p95 = results[prec_name]["p95_ms"] speedup = f"{base_lat / med:.2f}×" recall = results[prec_name]["recall"] marker = " ← baseline" if prec_name == "float32" else "" print(f"{prec_name:<10} {bits:>5} {mem_save:>12} {med:>10.2f} " f"{p95:>8.2f} {speedup:>9} {recall:>10.3f}{marker}")

Auto-recommendation

Filters precisions that meet RECALL_THRESHOLD (default 0.9), then picks the fastest qualifier.

RECALL_THRESHOLD = 0.9 # minimum acceptable recall candidates = [ (p, results[p]["median_ms"], results[p]["recall"]) for p in [name for name, _ in PRECISIONS] if results[p]["recall"] >= RECALL_THRESHOLD ] if candidates: best = sorted(candidates, key=lambda x: (-x[2], x[1]))[0] fastest = sorted(candidates, key=lambda x: x[1])[0] for name, med, rec in sorted(candidates, key=lambda x: x[1]): tag = [] if name == best[0]: tag.append("best recall") if name == fastest[0]: tag.append("fastest") label_str = " ← " + " + ".join(tag) if tag else "" print(f" {name:<10} median={med:.2f} ms recall={rec:.3f}{label_str}") print(f"\nRecommended: '{best[0]}'") else: print(f" No precision achieved recall ≥ {RECALL_THRESHOLD}.") print(" Consider raising ef_search or using float32.")

Cleanup

Deletes all five benchmark indexes to free storage.

for prec_name, _ in PRECISIONS: index_name = f"{INDEX_PREFIX}_{prec_name}" try: client.delete_index(index_name) print(f" Deleted {index_name}") except Exception as e: print(f" Could not delete {index_name}: {e}") print("\nCleanup complete.")

The Speed–Recall Trade-off

PrecisionSpeed rankRecall rank
float32Slowest (5th)Highest — ground truth (1st)
float164th2nd
int163rd3rd
int82nd4th
binaryFastest (1st)Lowest (5th)

Speed and recall move in opposite directions as you increase quantisation. There is no precision that wins on both axes — every step away from float32 trades some recall for some speed.

Practical Recommendations

If you need…Use
Near-perfect recall + lower latencyfloat16
Balance of speed and recallint16
High throughput, some recall loss okint8
Maximum speedbinary
Binary + high final precisionbinary → re-rank with float32
Exact scoresfloat32

float16 or int16 — best default for most production workloads. Both are faster than float32, use half the memory, and return results very close to float32.

int8 — choose when speed matters more than recall. Good for high-throughput pipelines where a small recall loss is acceptable.

binary — pair with a re-ranking step: retrieve top-K×5 with binary, then re-score with float32 vectors to recover precision.

float32 — only when exact similarity scores are required downstream (score-threshold filtering, calibration, or audit logging).

Takeaways

Speed (fastest → slowest): binaryint8int16float16float32

Recall (highest → lowest): float32float16int16int8binary

Speed and recall are inversely ordered across all five precisions. Every gain in speed comes with a cost in recall. Choose the precision that sits at the right point on that curve for your workload.

Implementation uses Endee local mode and all-MiniLM-L6-v2 (384-dim).