Vector Quantization Benchmark: float32 vs float16 vs int16 vs int8 vs binary
When you store vectors in a database you have to choose a precision - the numeric format each dimension is saved in. Higher precision stores more detail but uses more memory and is slower to search. Lower precision is faster and smaller but may return slightly different results.
This notebook benchmarks all five precision modes supported by Endee and measures two things for each: query speed and recall (how closely results match the float32 ground truth).
int16 is the recommended default. It is 2x faster and uses half the memory of float32, while returning results nearly identical to it. The benchmark below will show this clearly.
| Precision | Bits per dim | Memory vs float32 | Notes |
|---|---|---|---|
float32 | 32 | 1x (baseline) | Full precision - slowest |
float16 | 16 | 2x smaller | Half-precision float |
int16 | 16 | 2x smaller | Best balance of speed and recall - recommended |
int8 | 8 | 4x smaller | Faster but lower recall |
binary | 1 | 32x smaller | Fastest but lowest recall |
Prerequisite: Endee running at http://127.0.0.1:8080.
Install
Required packages:
endee- the vector database clientsentence-transformers- the local embedding modelnumpy==2.0.0- pinned to avoid compatibility issues
pip install endee sentence-transformers
pip install numpy==2.0.0Imports
Precision is an enum that sets the storage format when creating an index. time and statistics measure and summarize query latency.
import time
import statistics
from getpass import getpass
from sentence_transformers import SentenceTransformer
from endee import Endee
from endee.constants import PrecisionConfiguration
All benchmark settings in one place. N_RUNS controls how many timed queries run per precision - higher values give more stable estimates. QUERY is the single search query used across all five indexes.
MODEL_NAME = "all-MiniLM-L6-v2"
DENSE_DIM = 384
SPACE_TYPE = "cosine"
TOP_K = 10
N_RUNS = 50
INDEX_PREFIX = "bench_precision"
QUERY = "AI applications in healthcare and medicine"
PRECISIONS = [
("float32", Precision.FLOAT32),
("float16", Precision.FLOAT16),
("int16", Precision.INT16),
("int8", Precision.INT8),
("binary", Precision.BINARY2),
]Connect to Endee
Choose your connection method: local server or serverless cloud.
Local Server: If your server has NDD_AUTH_TOKEN set, pass the same token when initializing:
client = Endee("ndd-auth-token")
client.set_base_url("http://0.0.0.0:8080/api/v1")Endee Serverless: Go to https://app.endee.io , create a token, then pass it here:
client = Endee("your-serverless-token")Reads your API token via hidden input and connects to Endee. Five indexes will be created in the next step.
Create One Index Per Precision
Five indexes are created - one for each precision. Every index is identical except for the precision parameter. Any existing indexes with the same names are deleted first so each run starts clean.
indexes = {}
for prec_name, prec_enum in PRECISIONS:
index_name = f"{INDEX_PREFIX}_{prec_name}"
try:
client.delete_index(index_name)
except Exception:
pass
client.create_index(
name=index_name,
dimension=DENSE_DIM,
space_type=SPACE_TYPE,
precision=prec_enum,
)
indexes[prec_name] = client.get_index(index_name)
tag = " <-- recommended" if prec_name == "int16" else ""
print(f"Created {index_name} precision={prec_name}{tag}")
print(f"\n{len(indexes)} indexes ready")Prepare Corpus
Replace the DOCUMENTS list with your own data if you want to benchmark on real content. Each document needs an id, a text field, and a meta dict. The same documents go into every precision index so the only variable in the benchmark is the storage format.
DOCUMENTS = [
{"id": "doc_001", "text": "Machine learning is a subset of artificial intelligence that focuses on building systems that learn from data.", "meta": {"title": "Machine Learning Basics"}},
#...documents
]
assert len(DOCUMENTS) > 0, "DOCUMENTS is empty — add your corpus before running"
print(f"{len(DOCUMENTS)} documents ready")Encode Corpus and Query
All document vectors and the query vector are computed once with all-MiniLM-L6-v2 and stored in memory. The same float32 vectors are sent to every index - Endee converts them to the target precision at write time, so encoding only needs to happen once.
model = SentenceTransformer(MODEL_NAME)
doc_vectors = {
doc["id"]: model.encode(doc["text"]).tolist()
for doc in DOCUMENTS
}
query_vec = model.encode(QUERY).tolist()
print(f"Encoded {len(doc_vectors)} documents")
print(f"Query: '{QUERY}'")Upsert Documents Into Every Index
The same float32 vectors are sent to all five indexes. Endee converts them to the target precision at write time - you always send float32 and the index handles conversion internally. This keeps the benchmark fair: only the storage format changes.
for prec_name, _ in PRECISIONS:
index = indexes[prec_name]
payload = [
{"id": doc["id"], "vector": doc_vectors[doc["id"]], "meta": doc["meta"], "filter": {}}
for doc in DOCUMENTS
]
index.upsert(payload)
print(f"Upserted {len(payload)} docs into {index.name}")Benchmark - Speed
Each index runs the same query N_RUNS times. Three warm-up queries run first and are discarded to remove cold-start bias. Median and p95 latency are recorded for each precision.
Speed order from fastest to slowest: binary → int8 → int16 → float16 → float32
int16 is the sweet spot - it is meaningfully faster than float32 and float16, while being far more accurate than int8 and binary.
results = {}
for prec_name, _ in PRECISIONS:
index = indexes[prec_name]
latencies = []
for _ in range(3):
index.query(vector=query_vec, top_k=TOP_K)
for _ in range(N_RUNS):
t0 = time.perf_counter()
hits = index.query(vector=query_vec, top_k=TOP_K)
t1 = time.perf_counter()
latencies.append((t1 - t0) * 1000)
results[prec_name] = {
"latencies": latencies,
"median_ms": statistics.median(latencies),
"p95_ms": sorted(latencies)[int(0.95 * N_RUNS)],
"ids": [r["id"] for r in hits],
"hits": hits,
}
tag = " <-- best balance" if prec_name == "int16" else ""
print(
f" {prec_name:<8} "
f"median={results[prec_name]['median_ms']:6.2f} ms "
f"p95={results[prec_name]['p95_ms']:6.2f} ms "
f"top-1={hits[0]['id']} sim={hits[0]['similarity']:.4f}{tag}"
)
print("\nBenchmark complete")Recall vs float32 Ground Truth
Recall@K measures how many of the top-K results from each precision match the float32 top-K. A score of 1.0 means the index returned the exact same set. float32 is the ground truth because it is the most precise.
int16 consistently achieves recall very close to float32 and float16, while being significantly faster. It gives you the speed of a quantized format without meaningful accuracy loss - which is why it is the recommended default.
ground_truth = set(results["float32"]["ids"])
print(f"Ground truth (float32) top-{TOP_K}: {sorted(ground_truth)}\n")
print(f"{'Precision':<10} {'Recall@'+str(TOP_K):<12} {'Median ms':<12} {'p95 ms':<10} Notes")
print("─" * 82)
for prec_name, _ in PRECISIONS:
returned = set(results[prec_name]["ids"])
recall = len(returned & ground_truth) / len(ground_truth)
results[prec_name]["recall"] = recall
note = "ground truth — baseline"
if prec_name == "int16":
note = "best balance of speed and recall <-- RECOMMENDED"
elif prec_name != "float32":
note = ""
print(
f"{prec_name:<10} {recall:<12.3f} "
f"{results[prec_name]['median_ms']:<12.2f} "
f"{results[prec_name]['p95_ms']:<10.2f} "
f"{note}"
)Summary Table
One row per precision showing bits per dimension, memory saving, median and p95 latency, speedup over float32, and Recall@K. The int16 row is marked as the recommended choice.
BITS = {"float32": 32, "float16": 16, "int16": 16, "int8": 8, "binary": 1}
base_lat = results["float32"]["median_ms"]
print(f"{'Precision':<10} {'Bits':>5} {'Mem saving':>12} {'Median ms':>10} {'p95 ms':>8} {'Speedup':>9} {'Recall@'+str(TOP_K):>10}")
print("─" * 78)
for prec_name, _ in PRECISIONS:
bits = BITS[prec_name]
mem_save = f"{32 / bits:.1f}x"
med = results[prec_name]["median_ms"]
p95 = results[prec_name]["p95_ms"]
speedup = f"{base_lat / med:.2f}x"
recall = results[prec_name]["recall"]
tag = " <-- RECOMMENDED" if prec_name == "int16" else ""
tag = " <-- baseline" if prec_name == "float32" else tag
print(
f"{prec_name:<10} {bits:>5} {mem_save:>12} {med:>10.2f} "
f"{p95:>8.2f} {speedup:>9} {recall:>10.3f}{tag}"
)Auto-Recommendation
Filters precisions that meet RECALL_THRESHOLD (default 0.9) and picks the fastest one that qualifies. In most workloads this will recommend int16 - it passes the recall threshold comfortably and is faster than both float32 and float16. Raise RECALL_THRESHOLD toward 1.0 if your application requires near-exact results.
RECALL_THRESHOLD = 0.9
labels = [p[0] for p in PRECISIONS]
candidates = [
(p, results[p]["median_ms"], results[p]["recall"])
for p in labels
if results[p]["recall"] >= RECALL_THRESHOLD
]
print(f"Precisions with Recall@{TOP_K} >= {RECALL_THRESHOLD}:\n")
if candidates:
best = sorted(candidates, key=lambda x: (-x[2], x[1]))[0]
fastest = sorted(candidates, key=lambda x: x[1])[0]
for name, med, rec in sorted(candidates, key=lambda x: x[1]):
tags = []
if name == best[0]: tags.append("best recall")
if name == fastest[0]: tags.append("fastest")
label_str = " <-- " + " + ".join(tags) if tags else ""
print(f" {name:<10} median={med:.2f} ms recall={rec:.3f}{label_str}")
print(f"\nRecommended: '{best[0]}'")
print(f" Recall {best[2]:.3f} at {best[1]:.2f} ms median latency")
if best[0] != "float32":
speedup = results["float32"]["median_ms"] / best[1]
mem = 32 / BITS[best[0]]
print(f" {speedup:.2f}x faster and {mem:.1f}x less memory than float32")
else:
print(f"No precision achieved recall >= {RECALL_THRESHOLD}.")
print("Consider raising ef_search or using float32.")Cleanup
Deletes all five benchmark indexes.
for prec_name, _ in PRECISIONS:
index_name = f"{INDEX_PREFIX}_{prec_name}"
try:
client.delete_index(index_name)
print(f"Deleted: {index_name}")
except Exception as e:
print(f"Could not delete {index_name}: {e}")Key Takeaways
- float32 is the baseline but slower and uses more memory
- float16 is slightly faster but minimal improvement
- int16 is the recommended default - 2x faster than float32 with nearly identical recall
- int8 is faster still but recall drops noticeably
- binary is fastest but has the lowest recall
- Use
RECALL_THRESHOLDto automatically recommend the best precision for your accuracy requirements - The benchmark is fair because all vectors start as float32 - only the storage format changes
- Always encode documents and query once, then reuse across all precision tests