Vector quantization benchmark

Compare float32, float16, int16, int8e, int8, and binary vector formats to find the best speed-accuracy trade-off for your workload.

Picking a vector format is a balance. Going from float32 down to binary gets faster and uses less memory - but at some point the results start getting worse. This notebook tests all six formats that Endee supports and measures query speed and recall (how closely the results match float32).

The trade-off at a glance

Precision	Bits/dim	Memory vs float32	Speed	Recall
`float32`	32	baseline	slowest	ground truth
`float16`	16	2x smaller	slightly faster	near-identical
`int16`	16	2x smaller	significantly faster	near-identical
`int8e`	8	4x smaller	fast	near-identical
`int8`	8	4x smaller	fast	noticeably lower
`binary`	1	32x smaller	fastest	lowest

int8e is the enhanced int8 format - same 4x memory saving as int8 but with recall on par with int16. The curve bends at int16: big memory and speed gains with results that stay almost the same as float32.

Requirements

Endee Serverless token from app.endee.io .

Installation


pip install endee sentence-transformers
pip install numpy==2.0.0

Imports

time and statistics measure and summarize query latency.


import time
import statistics
from getpass import getpass
 
from sentence_transformers import SentenceTransformer
from endee import Endee

Configuration

All benchmark settings in one place. N_RUNS controls how many timed queries run per precision. QUERY is the single search query used across all six collections.


MODEL_NAME        = "all-MiniLM-L6-v2"
DENSE_DIM         = 384
SPACE_TYPE        = "cosine"
LIMIT             = 10
N_RUNS            = 50
COLLECTION_PREFIX = "bench_precision"
QUERY             = "AI applications in healthcare and medicine"
 
PRECISIONS = ["float32", "float16", "int16", "int8e", "int8", "binary"]

Authentication

Create a token at app.endee.io and pass it to the client:


client = Endee("your-serverless-token")

Creating benchmark collections

Six identical collections - one per precision. Existing collections with the same names are deleted first. Precision is set per-field in the params dict.


collections = {}
 
for prec_name in PRECISIONS:
    col_name = f"{COLLECTION_PREFIX}_{prec_name}"
 
    try:
        client.delete_collection(col_name)
    except Exception:
        pass
 
    client.create_collection(
        name=col_name,
        fields=[
            {
                "name": "embedding",
                "type": "vector",
                "params": {
                    "dimension": DENSE_DIM,
                    "space_type": SPACE_TYPE,
                    "precision": prec_name,
                },
            },
        ],
    )
    collections[prec_name] = client.get_collection(col_name)
    print(f"Created {col_name}  precision={prec_name}")
 
print(f"\n{len(collections)} collections ready")

Preparing the dataset

Replace the DOCUMENTS list with your own data to benchmark on real content. The same documents go into every collection so the only difference is the storage format.


DOCUMENTS = [
    {"id": "doc_001", "text": "Machine learning is a subset of artificial intelligence that focuses on building systems that learn from data.", "meta": {"title": "Machine Learning Basics"}},
    #...documents
]
 
assert len(DOCUMENTS) > 0, "DOCUMENTS is empty - add your corpus before running"
print(f"{len(DOCUMENTS)} documents ready")

Encoding documents and query

All vectors are computed once with all-MiniLM-L6-v2. The same float32 vectors are sent to every collection - Endee converts them to the target precision at write time.


model = SentenceTransformer(MODEL_NAME)
 
doc_vectors = {
    doc["id"]: model.encode(doc["text"]).tolist()
    for doc in DOCUMENTS
}
query_vec = model.encode(QUERY).tolist()
 
print(f"Encoded {len(doc_vectors)} documents")
print(f"Query: '{QUERY}'")

Upserting objects

The same float32 vectors are sent to all six collections. Endee converts them to the target precision at write time.


for prec_name in PRECISIONS:
    collection = collections[prec_name]
    objects = [
        {
            "id": doc["id"],
            "meta": doc["meta"],
            "filter": {},
            "fields": {"embedding": doc_vectors[doc["id"]]},
        }
        for doc in DOCUMENTS
    ]
    collection.upsert(objects)
    print(f"Upserted {len(objects)} docs into {collection.name}")

Measuring query speed

Each collection runs the same query N_RUNS times. Three warm-up queries run first and are thrown away. Median and p95 latency are recorded.


results = {}
 
for prec_name in PRECISIONS:
    collection = collections[prec_name]
    latencies  = []
 
    for _ in range(3):
        collection.search(fields={"embedding": {"query": query_vec, "limit": LIMIT}})
 
    for _ in range(N_RUNS):
        t0   = time.perf_counter()
        hits = collection.search(fields={"embedding": {"query": query_vec, "limit": LIMIT}})
        t1   = time.perf_counter()
        latencies.append((t1 - t0) * 1000)
 
    field_hits = hits["results"]["embedding"]
    results[prec_name] = {
        "latencies": latencies,
        "median_ms": statistics.median(latencies),
        "p95_ms":    sorted(latencies)[int(0.95 * N_RUNS)],
        "ids":       [r["id"] for r in field_hits],
        "hits":      field_hits,
    }
 
    print(
        f"  {prec_name:<8}  "
        f"median={results[prec_name]['median_ms']:6.2f} ms  "
        f"p95={results[prec_name]['p95_ms']:6.2f} ms  "
        f"top-1={field_hits[0]['id']}  sim={field_hits[0]['similarity']:.4f}"
    )
 
print("\nBenchmark complete")

Measuring recall

Recall@K measures how many of the top-K results from each precision match the float32 top-K. A score of 1.0 means the collection returned the exact same set.


ground_truth = set(results["float32"]["ids"])
 
print(f"Ground truth (float32) top-{LIMIT}: {sorted(ground_truth)}\n")
print(f"{'Precision':<10} {'Recall@'+str(LIMIT):<12} {'Median ms':<12} {'p95 ms':<10}")
print("-" * 56)
 
for prec_name in PRECISIONS:
    returned = set(results[prec_name]["ids"])
    recall   = len(returned & ground_truth) / len(ground_truth)
    results[prec_name]["recall"] = recall
 
    print(
        f"{prec_name:<10} {recall:<12.3f} "
        f"{results[prec_name]['median_ms']:<12.2f} "
        f"{results[prec_name]['p95_ms']:<10.2f}"
    )

Summary

One row per precision - bits per dimension, memory saving, latency, speedup over float32, and Recall@K.


BITS     = {"float32": 32, "float16": 16, "int16": 16, "int8e": 8, "int8": 8, "binary": 1}
base_lat = results["float32"]["median_ms"]
 
print(f"{'Precision':<10} {'Bits':>5} {'Mem saving':>12} {'Median ms':>10} {'p95 ms':>8} {'Speedup':>9} {'Recall@'+str(LIMIT):>10}")
print("-" * 78)
 
for prec_name in PRECISIONS:
    bits     = BITS[prec_name]
    mem_save = f"{32 / bits:.1f}x"
    med      = results[prec_name]["median_ms"]
    p95      = results[prec_name]["p95_ms"]
    speedup  = f"{base_lat / med:.2f}x"
    recall   = results[prec_name]["recall"]
 
    print(
        f"{prec_name:<10} {bits:>5} {mem_save:>12} {med:>10.2f} "
        f"{p95:>8.2f} {speedup:>9} {recall:>10.3f}"
    )

Choosing the best precision

Keeps only precisions that hit RECALL_THRESHOLD (default 0.9) and picks the fastest one that passes:


RECALL_THRESHOLD = 0.9
 
candidates = [
    (p, results[p]["median_ms"], results[p]["recall"])
    for p in PRECISIONS
    if results[p]["recall"] >= RECALL_THRESHOLD
]
 
print(f"Precisions with Recall@{LIMIT} >= {RECALL_THRESHOLD}:\n")
 
if candidates:
    best    = sorted(candidates, key=lambda x: (-x[2], x[1]))[0]
    fastest = sorted(candidates, key=lambda x: x[1])[0]
 
    for name, med, rec in sorted(candidates, key=lambda x: x[1]):
        tags = []
        if name == best[0]:    tags.append("best recall")
        if name == fastest[0]: tags.append("fastest")
        label_str = "  <-- " + " + ".join(tags) if tags else ""
        print(f"  {name:<10}  median={med:.2f} ms  recall={rec:.3f}{label_str}")
 
    print(f"\nRecommended: '{best[0]}'")
    print(f"  Recall {best[2]:.3f} at {best[1]:.2f} ms median latency")
    if best[0] != "float32":
        speedup = results["float32"]["median_ms"] / best[1]
        mem     = 32 / BITS[best[0]]
        print(f"  {speedup:.2f}x faster and {mem:.1f}x less memory than float32")
else:
    print(f"No precision achieved recall >= {RECALL_THRESHOLD}.")
    print("Consider raising ef_search or using float32.")

Cleanup


for prec_name in PRECISIONS:
    col_name = f"{COLLECTION_PREFIX}_{prec_name}"
    try:
        client.delete_collection(col_name)
        print(f"Deleted: {col_name}")
    except Exception as e:
        print(f"Could not delete {col_name}: {e}")

Key takeaways

Precision	Speed gain over float32	Memory saving	Recall impact
`float32`	-	-	ground truth
`float16`	minimal	2x	none
`int16`	significant	2x	none
`int8e`	large	4x	near-identical
`int8`	large	4x	noticeable drop
`binary`	largest	32x	substantial drop

int8e is the sweet spot for memory-constrained deployments - 4x smaller with recall on par with int16
The curve bends at int16 / int8e - faster and no accuracy cost
You always send float32 vectors - Endee handles the conversion at write time
Precision is set per-field in the collection’s params dict
Use RECALL_THRESHOLD to let the numbers make the decision for your workload