Skip to Content
TutorialsPrecision Guide

Object quantization benchmark

Open In Colab

Compare float32, float16, int16, int8, and binary object formats to find the best speed-accuracy trade-off for your workload.

Picking a object format is a balance. Going from float32 down to binary gets faster and uses less memory - but at some point the results start getting worse. This notebook tests all five formats that Endee supports and measures query speed and recall (how closely the results match float32).


The trade-off at a glance

PrecisionBits/dimMemory vs float32SpeedRecall
float3232baselineslowestground truth
float16162x smallerslightly fasternear-identical
int16162x smallersignificantly fasternear-identical
int884x smallerfastnoticeably lower
binary132x smallerfastestlowest

The curve bends at int16: same memory saving as float16, a much bigger speed gain, and results that stay almost the same as float32.


Requirements

Endee running at http://127.0.0.1:8080.


Installation

pip install endee sentence-transformers pip install numpy==2.0.0

Imports

Precision is an enum that sets the storage format when creating a collection. time and statistics measure and summarize query latency.

import time import statistics from getpass import getpass from sentence_transformers import SentenceTransformer from endee import Endee from endee.constants import Precision

Configuration

All benchmark settings in one place. N_RUNS controls how many timed queries run per precision. QUERY is the single search query used across all five collections.

MODEL_NAME = "all-MiniLM-L6-v2" DENSE_DIM = 384 SPACE_TYPE = "cosine" TOP_K = 10 N_RUNS = 50 COLLECTION_PREFIX = "bench_precision" QUERY = "AI applications in healthcare and medicine" PRECISIONS = [ ("float32", Precision.FLOAT32), ("float16", Precision.FLOAT16), ("int16", Precision.INT16), ("int8", Precision.INT8), ("binary", Precision.BINARY2), ]

Authentication

Local server

If NDD_AUTH_TOKEN is set, pass the same token:

client = Endee("ndd-auth-token") client.set_base_url("http://0.0.0.0:8080/api/v1")

Endee Cloud

Create a token at app.endee.io :

client = Endee("your-serverless-token")

Creating benchmark collections

Five identical collections - one per precision. Existing collections with the same names are deleted first.

collections = {} for prec_name, prec_enum in PRECISIONS: collection_name = f"{COLLECTION_PREFIX}_{prec_name}" try: client.delete_collection(collection_name) except Exception: pass client.create_collection( name=collection_name, dimension=DENSE_DIM, space_type=SPACE_TYPE, precision=prec_enum, ) collections[prec_name] = client.get_collection(collection_name) print(f"Created {collection_name} precision={prec_name}") print(f"\n{len(collections)} collections ready")

Preparing the dataset

Replace the DOCUMENTS list with your own data to benchmark on real content. The same documents go into every collection so the only difference is the storage format.

DOCUMENTS = [ {"id": "doc_001", "text": "Machine learning is a subset of artificial intelligence that focuses on building systems that learn from data.", "meta": {"title": "Machine Learning Basics"}}, #...documents ] assert len(DOCUMENTS) > 0, "DOCUMENTS is empty - add your corpus before running" print(f"{len(DOCUMENTS)} documents ready")

Encoding documents and query

All objects are computed once with all-MiniLM-L6-v2. The same float32 objects are sent to every collection - Endee converts them to the target precision at write time.

model = SentenceTransformer(MODEL_NAME) doc_vectors = { doc["id"]: model.encode(doc["text"]).tolist() for doc in DOCUMENTS } query_vec = model.encode(QUERY).tolist() print(f"Encoded {len(doc_vectors)} documents") print(f"Query: '{QUERY}'")

Indexing documents

The same float32 objects are sent to all five collections. Endee converts them to the target precision at write time.

for prec_name, _ in PRECISIONS: collection = collections[prec_name] payload = [ {"id": doc["id"], "vector": doc_vectors[doc["id"]], "meta": doc["meta"], "filter": {}} for doc in DOCUMENTS ] collection.upsert(payload) print(f"Upserted {len(payload)} docs into {collection.name}")

Measuring query speed

Each collection runs the same query N_RUNS times. Three warm-up queries run first and are thrown away. Median and p95 latency are recorded.

results = {} for prec_name, _ in PRECISIONS: collection = collections[prec_name] latencies = [] for _ in range(3): collection.query(vector=query_vec, top_k=TOP_K) for _ in range(N_RUNS): t0 = time.perf_counter() hits = collection.query(vector=query_vec, top_k=TOP_K) t1 = time.perf_counter() latencies.append((t1 - t0) * 1000) results[prec_name] = { "latencies": latencies, "median_ms": statistics.median(latencies), "p95_ms": sorted(latencies)[int(0.95 * N_RUNS)], "ids": [r["id"] for r in hits], "hits": hits, } print( f" {prec_name:<8} " f"median={results[prec_name]['median_ms']:6.2f} ms " f"p95={results[prec_name]['p95_ms']:6.2f} ms " f"top-1={hits[0]['id']} sim={hits[0]['similarity']:.4f}" ) print("\nBenchmark complete")

Measuring recall

Recall@K measures how many of the top-K results from each precision match the float32 top-K. A score of 1.0 means the collection returned the exact same set.

ground_truth = set(results["float32"]["ids"]) print(f"Ground truth (float32) top-{TOP_K}: {sorted(ground_truth)}\n") print(f"{'Precision':<10} {'Recall@'+str(TOP_K):<12} {'Median ms':<12} {'p95 ms':<10}") print("-" * 56) for prec_name, _ in PRECISIONS: returned = set(results[prec_name]["ids"]) recall = len(returned & ground_truth) / len(ground_truth) results[prec_name]["recall"] = recall print( f"{prec_name:<10} {recall:<12.3f} " f"{results[prec_name]['median_ms']:<12.2f} " f"{results[prec_name]['p95_ms']:<10.2f}" )

Summary

One row per precision - bits per dimension, memory saving, latency, speedup over float32, and Recall@K.

BITS = {"float32": 32, "float16": 16, "int16": 16, "int8": 8, "binary": 1} base_lat = results["float32"]["median_ms"] print(f"{'Precision':<10} {'Bits':>5} {'Mem saving':>12} {'Median ms':>10} {'p95 ms':>8} {'Speedup':>9} {'Recall@'+str(TOP_K):>10}") print("-" * 78) for prec_name, _ in PRECISIONS: bits = BITS[prec_name] mem_save = f"{32 / bits:.1f}x" med = results[prec_name]["median_ms"] p95 = results[prec_name]["p95_ms"] speedup = f"{base_lat / med:.2f}x" recall = results[prec_name]["recall"] print( f"{prec_name:<10} {bits:>5} {mem_save:>12} {med:>10.2f} " f"{p95:>8.2f} {speedup:>9} {recall:>10.3f}" )

Choosing the best precision

Keeps only precisions that hit RECALL_THRESHOLD (default 0.9) and picks the fastest one that passes:

RECALL_THRESHOLD = 0.9 labels = [p[0] for p in PRECISIONS] candidates = [ (p, results[p]["median_ms"], results[p]["recall"]) for p in labels if results[p]["recall"] >= RECALL_THRESHOLD ] print(f"Precisions with Recall@{TOP_K} >= {RECALL_THRESHOLD}:\n") if candidates: best = sorted(candidates, key=lambda x: (-x[2], x[1]))[0] fastest = sorted(candidates, key=lambda x: x[1])[0] for name, med, rec in sorted(candidates, key=lambda x: x[1]): tags = [] if name == best[0]: tags.append("best recall") if name == fastest[0]: tags.append("fastest") label_str = " <-- " + " + ".join(tags) if tags else "" print(f" {name:<10} median={med:.2f} ms recall={rec:.3f}{label_str}") print(f"\nRecommended: '{best[0]}'") print(f" Recall {best[2]:.3f} at {best[1]:.2f} ms median latency") if best[0] != "float32": speedup = results["float32"]["median_ms"] / best[1] mem = 32 / BITS[best[0]] print(f" {speedup:.2f}x faster and {mem:.1f}x less memory than float32") else: print(f"No precision achieved recall >= {RECALL_THRESHOLD}.") print("Consider raising ef_search or using float32.")

Cleanup

for prec_name, _ in PRECISIONS: collection_name = f"{COLLECTION_PREFIX}_{prec_name}" try: client.delete_collection(collection_name) print(f"Deleted: {collection_name}") except Exception as e: print(f"Could not delete {collection_name}: {e}")

Key takeaways

PrecisionSpeed gain over float32Memory savingRecall impact
float32--ground truth
float16minimal2xnone
int16significant2xnone
int8large4xnoticeable drop
binarylargest32xsubstantial drop
  • The curve bends at int16 - faster and no accuracy cost
  • You always send float32 objects - Endee handles the conversion at write time
  • Use RECALL_THRESHOLD to let the numbers make the decision for your workload