Object quantization benchmark
Compare float32, float16, int16, int8, and binary object formats to find the best speed-accuracy trade-off for your workload.
Picking a object format is a balance. Going from float32 down to binary gets faster and uses less memory - but at some point the results start getting worse. This notebook tests all five formats that Endee supports and measures query speed and recall (how closely the results match float32).
The trade-off at a glance
| Precision | Bits/dim | Memory vs float32 | Speed | Recall |
|---|---|---|---|---|
float32 | 32 | baseline | slowest | ground truth |
float16 | 16 | 2x smaller | slightly faster | near-identical |
int16 | 16 | 2x smaller | significantly faster | near-identical |
int8 | 8 | 4x smaller | fast | noticeably lower |
binary | 1 | 32x smaller | fastest | lowest |
The curve bends at int16: same memory saving as float16, a much bigger speed gain, and results that stay almost the same as float32.
Requirements
Endee running at http://127.0.0.1:8080.
Installation
pip install endee sentence-transformers
pip install numpy==2.0.0Imports
Precision is an enum that sets the storage format when creating a collection. time and statistics measure and summarize query latency.
import time
import statistics
from getpass import getpass
from sentence_transformers import SentenceTransformer
from endee import Endee
from endee.constants import PrecisionConfiguration
All benchmark settings in one place. N_RUNS controls how many timed queries run per precision. QUERY is the single search query used across all five collections.
MODEL_NAME = "all-MiniLM-L6-v2"
DENSE_DIM = 384
SPACE_TYPE = "cosine"
TOP_K = 10
N_RUNS = 50
COLLECTION_PREFIX = "bench_precision"
QUERY = "AI applications in healthcare and medicine"
PRECISIONS = [
("float32", Precision.FLOAT32),
("float16", Precision.FLOAT16),
("int16", Precision.INT16),
("int8", Precision.INT8),
("binary", Precision.BINARY2),
]Authentication
Local server
If NDD_AUTH_TOKEN is set, pass the same token:
client = Endee("ndd-auth-token")
client.set_base_url("http://0.0.0.0:8080/api/v1")Endee Cloud
Create a token at app.endee.io :
client = Endee("your-serverless-token")Creating benchmark collections
Five identical collections - one per precision. Existing collections with the same names are deleted first.
collections = {}
for prec_name, prec_enum in PRECISIONS:
collection_name = f"{COLLECTION_PREFIX}_{prec_name}"
try:
client.delete_collection(collection_name)
except Exception:
pass
client.create_collection(
name=collection_name,
dimension=DENSE_DIM,
space_type=SPACE_TYPE,
precision=prec_enum,
)
collections[prec_name] = client.get_collection(collection_name)
print(f"Created {collection_name} precision={prec_name}")
print(f"\n{len(collections)} collections ready")Preparing the dataset
Replace the DOCUMENTS list with your own data to benchmark on real content. The same documents go into every collection so the only difference is the storage format.
DOCUMENTS = [
{"id": "doc_001", "text": "Machine learning is a subset of artificial intelligence that focuses on building systems that learn from data.", "meta": {"title": "Machine Learning Basics"}},
#...documents
]
assert len(DOCUMENTS) > 0, "DOCUMENTS is empty - add your corpus before running"
print(f"{len(DOCUMENTS)} documents ready")Encoding documents and query
All objects are computed once with all-MiniLM-L6-v2. The same float32 objects are sent to every collection - Endee converts them to the target precision at write time.
model = SentenceTransformer(MODEL_NAME)
doc_vectors = {
doc["id"]: model.encode(doc["text"]).tolist()
for doc in DOCUMENTS
}
query_vec = model.encode(QUERY).tolist()
print(f"Encoded {len(doc_vectors)} documents")
print(f"Query: '{QUERY}'")Indexing documents
The same float32 objects are sent to all five collections. Endee converts them to the target precision at write time.
for prec_name, _ in PRECISIONS:
collection = collections[prec_name]
payload = [
{"id": doc["id"], "vector": doc_vectors[doc["id"]], "meta": doc["meta"], "filter": {}}
for doc in DOCUMENTS
]
collection.upsert(payload)
print(f"Upserted {len(payload)} docs into {collection.name}")Measuring query speed
Each collection runs the same query N_RUNS times. Three warm-up queries run first and are thrown away. Median and p95 latency are recorded.
results = {}
for prec_name, _ in PRECISIONS:
collection = collections[prec_name]
latencies = []
for _ in range(3):
collection.query(vector=query_vec, top_k=TOP_K)
for _ in range(N_RUNS):
t0 = time.perf_counter()
hits = collection.query(vector=query_vec, top_k=TOP_K)
t1 = time.perf_counter()
latencies.append((t1 - t0) * 1000)
results[prec_name] = {
"latencies": latencies,
"median_ms": statistics.median(latencies),
"p95_ms": sorted(latencies)[int(0.95 * N_RUNS)],
"ids": [r["id"] for r in hits],
"hits": hits,
}
print(
f" {prec_name:<8} "
f"median={results[prec_name]['median_ms']:6.2f} ms "
f"p95={results[prec_name]['p95_ms']:6.2f} ms "
f"top-1={hits[0]['id']} sim={hits[0]['similarity']:.4f}"
)
print("\nBenchmark complete")Measuring recall
Recall@K measures how many of the top-K results from each precision match the float32 top-K. A score of 1.0 means the collection returned the exact same set.
ground_truth = set(results["float32"]["ids"])
print(f"Ground truth (float32) top-{TOP_K}: {sorted(ground_truth)}\n")
print(f"{'Precision':<10} {'Recall@'+str(TOP_K):<12} {'Median ms':<12} {'p95 ms':<10}")
print("-" * 56)
for prec_name, _ in PRECISIONS:
returned = set(results[prec_name]["ids"])
recall = len(returned & ground_truth) / len(ground_truth)
results[prec_name]["recall"] = recall
print(
f"{prec_name:<10} {recall:<12.3f} "
f"{results[prec_name]['median_ms']:<12.2f} "
f"{results[prec_name]['p95_ms']:<10.2f}"
)Summary
One row per precision - bits per dimension, memory saving, latency, speedup over float32, and Recall@K.
BITS = {"float32": 32, "float16": 16, "int16": 16, "int8": 8, "binary": 1}
base_lat = results["float32"]["median_ms"]
print(f"{'Precision':<10} {'Bits':>5} {'Mem saving':>12} {'Median ms':>10} {'p95 ms':>8} {'Speedup':>9} {'Recall@'+str(TOP_K):>10}")
print("-" * 78)
for prec_name, _ in PRECISIONS:
bits = BITS[prec_name]
mem_save = f"{32 / bits:.1f}x"
med = results[prec_name]["median_ms"]
p95 = results[prec_name]["p95_ms"]
speedup = f"{base_lat / med:.2f}x"
recall = results[prec_name]["recall"]
print(
f"{prec_name:<10} {bits:>5} {mem_save:>12} {med:>10.2f} "
f"{p95:>8.2f} {speedup:>9} {recall:>10.3f}"
)Choosing the best precision
Keeps only precisions that hit RECALL_THRESHOLD (default 0.9) and picks the fastest one that passes:
RECALL_THRESHOLD = 0.9
labels = [p[0] for p in PRECISIONS]
candidates = [
(p, results[p]["median_ms"], results[p]["recall"])
for p in labels
if results[p]["recall"] >= RECALL_THRESHOLD
]
print(f"Precisions with Recall@{TOP_K} >= {RECALL_THRESHOLD}:\n")
if candidates:
best = sorted(candidates, key=lambda x: (-x[2], x[1]))[0]
fastest = sorted(candidates, key=lambda x: x[1])[0]
for name, med, rec in sorted(candidates, key=lambda x: x[1]):
tags = []
if name == best[0]: tags.append("best recall")
if name == fastest[0]: tags.append("fastest")
label_str = " <-- " + " + ".join(tags) if tags else ""
print(f" {name:<10} median={med:.2f} ms recall={rec:.3f}{label_str}")
print(f"\nRecommended: '{best[0]}'")
print(f" Recall {best[2]:.3f} at {best[1]:.2f} ms median latency")
if best[0] != "float32":
speedup = results["float32"]["median_ms"] / best[1]
mem = 32 / BITS[best[0]]
print(f" {speedup:.2f}x faster and {mem:.1f}x less memory than float32")
else:
print(f"No precision achieved recall >= {RECALL_THRESHOLD}.")
print("Consider raising ef_search or using float32.")Cleanup
for prec_name, _ in PRECISIONS:
collection_name = f"{COLLECTION_PREFIX}_{prec_name}"
try:
client.delete_collection(collection_name)
print(f"Deleted: {collection_name}")
except Exception as e:
print(f"Could not delete {collection_name}: {e}")Key takeaways
| Precision | Speed gain over float32 | Memory saving | Recall impact |
|---|---|---|---|
float32 | - | - | ground truth |
float16 | minimal | 2x | none |
int16 | significant | 2x | none |
int8 | large | 4x | noticeable drop |
binary | largest | 32x | substantial drop |
- The curve bends at
int16- faster and no accuracy cost - You always send
float32objects - Endee handles the conversion at write time - Use
RECALL_THRESHOLDto let the numbers make the decision for your workload