Choosing the Right Vector Precision
Every vector database eventually forces you to make a decision nobody warns you about: what numeric precision should your embeddings be stored in? The default answer is usually “float32 — it’s safest.” But is it actually the best choice?
This tutorial runs a controlled benchmark across all five precision modes supported by Endee. Run it on your own corpus to find out which precision fits your speed and recall requirements.
Prerequisites: Endee running locally on http://127.0.0.1:8080
What We Measure
The benchmark holds everything constant except one thing — the storage precision of the index:
- Corpus: your documents — plug in any text corpus
- Embedding model:
all-MiniLM-L6-v2(384-dimensional dense vectors) - Space type: cosine similarity
- Query: configurable via
QUERY - Top-K: configurable via
TOP_K - Timing:
N_RUNStimed queries per precision; warm-up queries discarded to eliminate cold-start bias - Recall baseline:
float32top-K is treated as ground truth; every other precision is scored against it
Five separate dense indexes are created — one per precision — with identical vectors upserted into each. Same data, same query, different storage format.
The Five Precisions
| Precision | Bits/dim | Memory vs float32 | What it stores |
|---|---|---|---|
float32 | 32 | 1× (baseline) | Full IEEE 754 single-precision float |
float16 | 16 | 2× smaller | Half-precision IEEE 754 float |
int16 | 16 | 2× smaller | 16-bit signed integer (linear quantization) |
int8 | 8 | 4× smaller | 8-bit signed integer (aggressive quantization) |
binary | 1 | 32× smaller | Single bit per dimension (Hamming distance) |
The memory savings compound fast. A 1-million-vector index at 384 dimensions costs ~1.5 GB in float32. The same index in binary fits in under 50 MB.
Install and Import
endee.constants.Precision is an enum that selects the storage format for each index. SentenceTransformer encodes the corpus and query; time and statistics handle latency measurement.
pip install endee sentence-transformersimport time
import statistics
from sentence_transformers import SentenceTransformer
from endee import Endee
from endee.constants import Precision
print("Imports OK")Configuration
TOP_K, N_RUNS, and QUERY are the three knobs that control the benchmark. Increase N_RUNS for more stable timing estimates. Change QUERY to test a different retrieval scenario on your workload.
MODEL_NAME = "all-MiniLM-L6-v2" # 384-dim dense model
DENSE_DIM = 384
SPACE_TYPE = "cosine"
TOP_K = 10 # results to retrieve per query
N_RUNS = 50 # query repetitions for timing
# All five precisions to benchmark
PRECISIONS = [
("float32", Precision.FLOAT32),
("float16", Precision.FLOAT16),
("int16", Precision.INT16),
("int8", Precision.INT8),
("binary", Precision.BINARY2),
]
INDEX_PREFIX = "bench_precision"
# The single query used for every index
QUERY = "AI applications in healthcare and medicine"Define Your Corpus
Replace the DOCUMENTS list with your own corpus. Each document needs an id, a text field, and a meta dict. The benchmark code is dataset-agnostic.
The same documents are upserted into every precision index so the comparison is always fair: only storage precision changes, not the source text or embedding model.
DOCUMENTS = [
{"id": "doc_001", "text": "AI coding assistants boost developer productivity and reduce boilerplate writing", "meta": {"title": "AI Coding Assistants"}},
{"id": "doc_002", "text": "Differential privacy adds calibrated noise to datasets to protect individuals", "meta": {"title": "Differential Privacy"}},
# ... add your own documents here
]
assert len(DOCUMENTS) > 0, "DOCUMENTS is empty — add your corpus before running"
print(f"Corpus: {len(DOCUMENTS)} documents")Load Embedding Model and Encode Corpus
All document vectors and the query vector are pre-computed once and cached in doc_vectors. This avoids re-encoding across precision types — the same float32 values are sent to every index, quantized differently by each.
print(f"Loading {MODEL_NAME} ...")
model = SentenceTransformer(MODEL_NAME)
print(f"Model loaded — dim={model.get_sentence_embedding_dimension()}\n")
# Encode every document once; reuse vectors for all precision indexes
doc_vectors = {
doc["id"]: model.encode(doc["text"]).tolist()
for doc in DOCUMENTS
}
# Encode the benchmark query once
query_vec = model.encode(QUERY).tolist()
print(f"Encoded {len(doc_vectors)} documents")
print(f'Query : "{QUERY}"')Connect to Endee and Prepare Indexes
One dense index is created per precision. Existing indexes with the same names are deleted first for a clean slate. precision=prec_enum is the only parameter that differs across indexes.
client = Endee()
print("Connected to Endee\n")
indexes = {} # precision_name -> Index object
for prec_name, prec_enum in PRECISIONS:
index_name = f"{INDEX_PREFIX}_{prec_name}"
# Delete if already exists (clean slate)
try:
client.delete_index(index_name)
print(f" Deleted existing index: {index_name}")
except Exception:
pass
# Create fresh dense index
client.create_index(
name=index_name,
dimension=DENSE_DIM,
space_type=SPACE_TYPE,
precision=prec_enum,
sparse_dim=0,
)
print(f" Created {index_name:35s} precision={prec_name}")
indexes[prec_name] = client.get_index(index_name)
print(f"\n{len(indexes)} indexes ready.")Upsert Documents into Every Index
The identical float32 payload goes to all five indexes. Endee quantizes the input vectors to the target precision at upsert time — you always send float32; the index handles conversion internally.
BATCH_SIZE = 1000 # Endee max vectors per upsert call
for prec_name, _ in PRECISIONS:
index = indexes[prec_name]
payload = [
{
"id": doc["id"],
"vector": doc_vectors[doc["id"]],
"meta": doc["meta"],
"filter": {},
}
for doc in DOCUMENTS
]
# Upsert in batches of ≤1000 (Endee hard limit)
for i in range(0, len(payload), BATCH_SIZE):
index.upsert(payload[i : i + BATCH_SIZE])
print(f" Upserted {len(payload)} docs → {index.name}")
print("\nAll indexes populated.")Benchmark — Speed and Recall
Speed: run the same query N_RUNS times per index; record median latency in ms.
Recall@K: use float32 as ground truth. Recall = |returned ∩ ground_truth| / K
Three warm-up queries are run before timing begins to eliminate cold-start bias.
results = {} # prec_name -> {latencies, ids, hits, ...}
for prec_name, _ in PRECISIONS:
index = indexes[prec_name]
latencies = []
# Warm-up: 3 un-timed queries to avoid cold-start bias
for _ in range(3):
index.query(vector=query_vec, top_k=TOP_K)
# Timed runs
for _ in range(N_RUNS):
t0 = time.perf_counter()
hits = index.query(vector=query_vec, top_k=TOP_K)
t1 = time.perf_counter()
latencies.append((t1 - t0) * 1000) # ms
results[prec_name] = {
"latencies": latencies,
"median_ms": statistics.median(latencies),
"p95_ms": sorted(latencies)[int(0.95 * N_RUNS)],
"ids": [r["id"] for r in hits],
"hits": hits,
}
print(
f" {prec_name:<8} "
f"median={results[prec_name]['median_ms']:6.2f} ms "
f"p95={results[prec_name]['p95_ms']:6.2f} ms "
f"top-1={hits[0]['id']} sim={hits[0]['similarity']:.4f}"
)
print("\nBenchmark complete.")Speed order (fastest → slowest): binary → int8 → int16 → float16 → float32
binary— fastest. Hamming distance on packed bits uses CPU SIMDpopcount— the cheapest distance operation available.int8— second fastest. 8-bit integer arithmetic has lower overhead than wider integer or floating-point paths.int16— third. Integer SIMD at 16-bit is faster than floating-point at the same bit width.float16— fourth. Floating-point overhead at half-precision is lower than float32 but higher than integer paths.float32— slowest. Full 32-bit IEEE 754 arithmetic is the baseline.
Compute Recall@K vs float32 Ground Truth
Recall@K = |returned ∩ ground_truth| / K. float32 results serve as ground truth. A recall of 1.0 means the quantized index returns the exact same top-K set.
ground_truth = set(results["float32"]["ids"])
print(f"Ground truth (float32) top-{TOP_K}: {sorted(ground_truth)}\n")
print(f"{'Precision':<10} {'Recall@'+str(TOP_K):<12} {'Median ms':<12} {'p95 ms':<10} Returned IDs")
print("─" * 90)
for prec_name, _ in PRECISIONS:
returned = set(results[prec_name]["ids"])
recall = len(returned & ground_truth) / len(ground_truth)
results[prec_name]["recall"] = recall
print(
f"{prec_name:<10} {recall:<12.3f} "
f"{results[prec_name]['median_ms']:<12.2f} "
f"{results[prec_name]['p95_ms']:<10.2f} "
f"{results[prec_name]['ids']}"
)Recall order (highest → lowest): float32 → float16 → int16 → int8 → binary
Summary Table
Consolidates bits per dimension, memory saving, median/p95 latency, speedup vs float32, and Recall@K.
BITS = {"float32": 32, "float16": 16, "int16": 16, "int8": 8, "binary": 1}
base_mem = BITS["float32"]
base_lat = results["float32"]["median_ms"]
print(f"{'Precision':<10} {'Bits':>5} {'Mem saving':>12} {'Median ms':>10} "
f"{'p95 ms':>8} {'Speedup':>9} {'Recall@'+str(TOP_K):>10}")
print("─" * 72)
for prec_name, _ in PRECISIONS:
bits = BITS[prec_name]
mem_save = f"{base_mem / bits:.1f}×"
med = results[prec_name]["median_ms"]
p95 = results[prec_name]["p95_ms"]
speedup = f"{base_lat / med:.2f}×"
recall = results[prec_name]["recall"]
marker = " ← baseline" if prec_name == "float32" else ""
print(f"{prec_name:<10} {bits:>5} {mem_save:>12} {med:>10.2f} "
f"{p95:>8.2f} {speedup:>9} {recall:>10.3f}{marker}")Auto-recommendation
Filters precisions that meet RECALL_THRESHOLD (default 0.9), then picks the fastest qualifier.
RECALL_THRESHOLD = 0.9 # minimum acceptable recall
candidates = [
(p, results[p]["median_ms"], results[p]["recall"])
for p in [name for name, _ in PRECISIONS]
if results[p]["recall"] >= RECALL_THRESHOLD
]
if candidates:
best = sorted(candidates, key=lambda x: (-x[2], x[1]))[0]
fastest = sorted(candidates, key=lambda x: x[1])[0]
for name, med, rec in sorted(candidates, key=lambda x: x[1]):
tag = []
if name == best[0]: tag.append("best recall")
if name == fastest[0]: tag.append("fastest")
label_str = " ← " + " + ".join(tag) if tag else ""
print(f" {name:<10} median={med:.2f} ms recall={rec:.3f}{label_str}")
print(f"\nRecommended: '{best[0]}'")
else:
print(f" No precision achieved recall ≥ {RECALL_THRESHOLD}.")
print(" Consider raising ef_search or using float32.")Cleanup
Deletes all five benchmark indexes to free storage.
for prec_name, _ in PRECISIONS:
index_name = f"{INDEX_PREFIX}_{prec_name}"
try:
client.delete_index(index_name)
print(f" Deleted {index_name}")
except Exception as e:
print(f" Could not delete {index_name}: {e}")
print("\nCleanup complete.")The Speed–Recall Trade-off
| Precision | Speed rank | Recall rank |
|---|---|---|
float32 | Slowest (5th) | Highest — ground truth (1st) |
float16 | 4th | 2nd |
int16 | 3rd | 3rd |
int8 | 2nd | 4th |
binary | Fastest (1st) | Lowest (5th) |
Speed and recall move in opposite directions as you increase quantisation. There is no precision that wins on both axes — every step away from float32 trades some recall for some speed.
Practical Recommendations
| If you need… | Use |
|---|---|
| Near-perfect recall + lower latency | float16 |
| Balance of speed and recall | int16 |
| High throughput, some recall loss ok | int8 |
| Maximum speed | binary |
| Binary + high final precision | binary → re-rank with float32 |
| Exact scores | float32 |
float16 or int16 — best default for most production workloads. Both are faster than float32, use half the memory, and return results very close to float32.
int8 — choose when speed matters more than recall. Good for high-throughput pipelines where a small recall loss is acceptable.
binary — pair with a re-ranking step: retrieve top-K×5 with binary, then re-score with float32 vectors to recover precision.
float32 — only when exact similarity scores are required downstream (score-threshold filtering, calibration, or audit logging).
Takeaways
Speed (fastest → slowest): binary → int8 → int16 → float16 → float32
Recall (highest → lowest): float32 → float16 → int16 → int8 → binary
Speed and recall are inversely ordered across all five precisions. Every gain in speed comes with a cost in recall. Choose the precision that sits at the right point on that curve for your workload.
Implementation uses Endee local mode and all-MiniLM-L6-v2 (384-dim).