Vector Quantization Benchmark: float32 vs float16 vs int16 vs int8 vs binary

Time: 15-20 minLevel: Intermediate

When you store vectors in a database you have to choose a precision - the numeric format each dimension is saved in. Higher precision stores more detail but uses more memory and is slower to search. Lower precision is faster and smaller but may return slightly different results.

This notebook benchmarks all five precision modes supported by Endee and measures two things for each: query speed and recall (how closely results match the float32 ground truth).

int16 is the recommended default. It is 2x faster and uses half the memory of float32, while returning results nearly identical to it. The benchmark below will show this clearly.

Precision	Bits per dim	Memory vs float32	Notes
`float32`	32	1x (baseline)	Full precision - slowest
`float16`	16	2x smaller	Half-precision float
`int16`	16	2x smaller	Best balance of speed and recall - recommended
`int8`	8	4x smaller	Faster but lower recall
`binary`	1	32x smaller	Fastest but lowest recall

Prerequisite: Endee running at http://127.0.0.1:8080.

Install

Required packages:

endee - the vector database client
sentence-transformers - the local embedding model
numpy==2.0.0 - pinned to avoid compatibility issues


pip install endee sentence-transformers
pip install numpy==2.0.0

Imports

Precision is an enum that sets the storage format when creating an index. time and statistics measure and summarize query latency.


import time
import statistics
from getpass import getpass
 
from sentence_transformers import SentenceTransformer
from endee import Endee
from endee.constants import Precision

Configuration

All benchmark settings in one place. N_RUNS controls how many timed queries run per precision - higher values give more stable estimates. QUERY is the single search query used across all five indexes.


MODEL_NAME   = "all-MiniLM-L6-v2"
DENSE_DIM    = 384
SPACE_TYPE   = "cosine"
TOP_K        = 10
N_RUNS       = 50
INDEX_PREFIX = "bench_precision"
QUERY        = "AI applications in healthcare and medicine"
 
PRECISIONS = [
    ("float32", Precision.FLOAT32),
    ("float16", Precision.FLOAT16),
    ("int16",   Precision.INT16),
    ("int8",    Precision.INT8),
    ("binary",  Precision.BINARY2),
]

Connect to Endee

Choose your connection method: local server or serverless cloud.

Local Server: If your server has NDD_AUTH_TOKEN set, pass the same token when initializing:


client = Endee("ndd-auth-token")
client.set_base_url("http://0.0.0.0:8080/api/v1")

Endee Serverless: Go to https://app.endee.io , create a token, then pass it here:


client = Endee("your-serverless-token")

Reads your API token via hidden input and connects to Endee. Five indexes will be created in the next step.

Create One Index Per Precision

Five indexes are created - one for each precision. Every index is identical except for the precision parameter. Any existing indexes with the same names are deleted first so each run starts clean.


indexes = {}
 
for prec_name, prec_enum in PRECISIONS:
    index_name = f"{INDEX_PREFIX}_{prec_name}"
 
    try:
        client.delete_index(index_name)
    except Exception:
        pass
 
    client.create_index(
        name=index_name,
        dimension=DENSE_DIM,
        space_type=SPACE_TYPE,
        precision=prec_enum,
    )
    indexes[prec_name] = client.get_index(index_name)
    tag = "  <-- recommended" if prec_name == "int16" else ""
    print(f"Created {index_name}  precision={prec_name}{tag}")
 
print(f"\n{len(indexes)} indexes ready")

Prepare Corpus

Replace the DOCUMENTS list with your own data if you want to benchmark on real content. Each document needs an id, a text field, and a meta dict. The same documents go into every precision index so the only variable in the benchmark is the storage format.


DOCUMENTS = [
    {"id": "doc_001", "text": "Machine learning is a subset of artificial intelligence that focuses on building systems that learn from data.",          "meta": {"title": "Machine Learning Basics"}},
    #...documents
]
 
assert len(DOCUMENTS) > 0, "DOCUMENTS is empty — add your corpus before running"
print(f"{len(DOCUMENTS)} documents ready")

Encode Corpus and Query

All document vectors and the query vector are computed once with all-MiniLM-L6-v2 and stored in memory. The same float32 vectors are sent to every index - Endee converts them to the target precision at write time, so encoding only needs to happen once.


model = SentenceTransformer(MODEL_NAME)
 
doc_vectors = {
    doc["id"]: model.encode(doc["text"]).tolist()
    for doc in DOCUMENTS
}
query_vec = model.encode(QUERY).tolist()
 
print(f"Encoded {len(doc_vectors)} documents")
print(f"Query: '{QUERY}'")

Upsert Documents Into Every Index

The same float32 vectors are sent to all five indexes. Endee converts them to the target precision at write time - you always send float32 and the index handles conversion internally. This keeps the benchmark fair: only the storage format changes.


for prec_name, _ in PRECISIONS:
    index   = indexes[prec_name]
    payload = [
        {"id": doc["id"], "vector": doc_vectors[doc["id"]], "meta": doc["meta"], "filter": {}}
        for doc in DOCUMENTS
    ]
    index.upsert(payload)
    print(f"Upserted {len(payload)} docs into {index.name}")

Benchmark - Speed

Each index runs the same query N_RUNS times. Three warm-up queries run first and are discarded to remove cold-start bias. Median and p95 latency are recorded for each precision.

Speed order from fastest to slowest: binary → int8 → int16 → float16 → float32

int16 is the sweet spot - it is meaningfully faster than float32 and float16, while being far more accurate than int8 and binary.


results = {}
 
for prec_name, _ in PRECISIONS:
    index     = indexes[prec_name]
    latencies = []
 
    for _ in range(3):
        index.query(vector=query_vec, top_k=TOP_K)
 
    for _ in range(N_RUNS):
        t0   = time.perf_counter()
        hits = index.query(vector=query_vec, top_k=TOP_K)
        t1   = time.perf_counter()
        latencies.append((t1 - t0) * 1000)
 
    results[prec_name] = {
        "latencies": latencies,
        "median_ms": statistics.median(latencies),
        "p95_ms":    sorted(latencies)[int(0.95 * N_RUNS)],
        "ids":       [r["id"] for r in hits],
        "hits":      hits,
    }
 
    tag = "  <-- best balance" if prec_name == "int16" else ""
    print(
        f"  {prec_name:<8}  "
        f"median={results[prec_name]['median_ms']:6.2f} ms  "
        f"p95={results[prec_name]['p95_ms']:6.2f} ms  "
        f"top-1={hits[0]['id']}  sim={hits[0]['similarity']:.4f}{tag}"
    )
 
print("\nBenchmark complete")

Recall vs float32 Ground Truth

Recall@K measures how many of the top-K results from each precision match the float32 top-K. A score of 1.0 means the index returned the exact same set. float32 is the ground truth because it is the most precise.

int16 consistently achieves recall very close to float32 and float16, while being significantly faster. It gives you the speed of a quantized format without meaningful accuracy loss - which is why it is the recommended default.


ground_truth = set(results["float32"]["ids"])
 
print(f"Ground truth (float32) top-{TOP_K}: {sorted(ground_truth)}\n")
print(f"{'Precision':<10} {'Recall@'+str(TOP_K):<12} {'Median ms':<12} {'p95 ms':<10} Notes")
print("─" * 82)
 
for prec_name, _ in PRECISIONS:
    returned = set(results[prec_name]["ids"])
    recall   = len(returned & ground_truth) / len(ground_truth)
    results[prec_name]["recall"] = recall
 
    note = "ground truth — baseline"
    if prec_name == "int16":
        note = "best balance of speed and recall  <-- RECOMMENDED"
    elif prec_name != "float32":
        note = ""
 
    print(
        f"{prec_name:<10} {recall:<12.3f} "
        f"{results[prec_name]['median_ms']:<12.2f} "
        f"{results[prec_name]['p95_ms']:<10.2f} "
        f"{note}"
    )

Summary Table

One row per precision showing bits per dimension, memory saving, median and p95 latency, speedup over float32, and Recall@K. The int16 row is marked as the recommended choice.


BITS     = {"float32": 32, "float16": 16, "int16": 16, "int8": 8, "binary": 1}
base_lat = results["float32"]["median_ms"]
 
print(f"{'Precision':<10} {'Bits':>5} {'Mem saving':>12} {'Median ms':>10} {'p95 ms':>8} {'Speedup':>9} {'Recall@'+str(TOP_K):>10}")
print("─" * 78)
 
for prec_name, _ in PRECISIONS:
    bits     = BITS[prec_name]
    mem_save = f"{32 / bits:.1f}x"
    med      = results[prec_name]["median_ms"]
    p95      = results[prec_name]["p95_ms"]
    speedup  = f"{base_lat / med:.2f}x"
    recall   = results[prec_name]["recall"]
 
    tag = "  <-- RECOMMENDED" if prec_name == "int16"   else ""
    tag = "  <-- baseline"    if prec_name == "float32" else tag
 
    print(
        f"{prec_name:<10} {bits:>5} {mem_save:>12} {med:>10.2f} "
        f"{p95:>8.2f} {speedup:>9} {recall:>10.3f}{tag}"
    )

Auto-Recommendation

Filters precisions that meet RECALL_THRESHOLD (default 0.9) and picks the fastest one that qualifies. In most workloads this will recommend int16 - it passes the recall threshold comfortably and is faster than both float32 and float16. Raise RECALL_THRESHOLD toward 1.0 if your application requires near-exact results.


RECALL_THRESHOLD = 0.9
 
labels     = [p[0] for p in PRECISIONS]
candidates = [
    (p, results[p]["median_ms"], results[p]["recall"])
    for p in labels
    if results[p]["recall"] >= RECALL_THRESHOLD
]
 
print(f"Precisions with Recall@{TOP_K} >= {RECALL_THRESHOLD}:\n")
 
if candidates:
    best    = sorted(candidates, key=lambda x: (-x[2], x[1]))[0]
    fastest = sorted(candidates, key=lambda x: x[1])[0]
 
    for name, med, rec in sorted(candidates, key=lambda x: x[1]):
        tags = []
        if name == best[0]:    tags.append("best recall")
        if name == fastest[0]: tags.append("fastest")
        label_str = "  <-- " + " + ".join(tags) if tags else ""
        print(f"  {name:<10}  median={med:.2f} ms  recall={rec:.3f}{label_str}")
 
    print(f"\nRecommended: '{best[0]}'")
    print(f"  Recall {best[2]:.3f} at {best[1]:.2f} ms median latency")
    if best[0] != "float32":
        speedup = results["float32"]["median_ms"] / best[1]
        mem     = 32 / BITS[best[0]]
        print(f"  {speedup:.2f}x faster and {mem:.1f}x less memory than float32")
else:
    print(f"No precision achieved recall >= {RECALL_THRESHOLD}.")
    print("Consider raising ef_search or using float32.")

Cleanup

Deletes all five benchmark indexes.


for prec_name, _ in PRECISIONS:
    index_name = f"{INDEX_PREFIX}_{prec_name}"
    try:
        client.delete_index(index_name)
        print(f"Deleted: {index_name}")
    except Exception as e:
        print(f"Could not delete {index_name}: {e}")

Key Takeaways

float32 is the baseline but slower and uses more memory
float16 is slightly faster but minimal improvement
int16 is the recommended default - 2x faster than float32 with nearly identical recall
int8 is faster still but recall drops noticeably
binary is fastest but has the lowest recall
Use RECALL_THRESHOLD to automatically recommend the best precision for your accuracy requirements
The benchmark is fair because all vectors start as float32 - only the storage format changes
Always encode documents and query once, then reuse across all precision tests