Hybrid Search with BM25 and Dense Vectors
In this tutorial, you will:
- Load the SciFact dataset — ~5,000 scientific article abstracts and 1,100 search queries
- Create BM25 sparse embeddings for both documents and queries
- Build a hybrid Endee index that stores both keyword and semantic vectors
- Run hybrid queries that combine both signals and compare results
What is BM25? A classic keyword ranking algorithm. It scores documents by how often your search terms appear — and down-weights common words like “the” in favour of rare, specific ones.
What is hybrid search? Combining BM25 (exact keyword matches) with dense vectors (semantic meaning). BM25 catches exact term matches; dense vectors catch synonyms and paraphrases. Together they handle more queries well.
Query ──► [Dense Embed: all-MiniLM-L6-v2] ──► 384-dim vector ──┐
├──► Endee Hybrid Rank ──► Top-K
Query ──► [Sparse Embed: BM25] ──► sparse vector ──┘Prerequisites: Endee running locally on http://127.0.0.1:8080
Why Two Separate Embedding Functions?
BM25 treats documents and queries differently. A long article should be scored differently from a 6-word query — applying the same formula to both would unfairly penalise short queries.
| Function | Use it for | What it applies |
|---|---|---|
SparseModel.embed(documents) | Corpus / documents | Full BM25: word frequency × word rarity, adjusted for document length |
SparseModel.query_embed(query) | Search queries | Word rarity only — no length penalty |
Rule: always use .embed() for documents and .query_embed() for queries. Mixing them produces incorrect BM25 scores.
Install Dependencies
Install endee and endee-model for the vector store and BM25 model, sentence-transformers for dense embeddings, and datasets to load SciFact from HuggingFace.
pip install "datasets<3.0.0" endee-model tqdm endee sentence-transformersImport Libraries
import json
from pathlib import Path
from datasets import load_dataset
from endee_model import SparseModel
from tqdm import tqdm
print("Imports OK")Configuration
| Variable | Default | Purpose |
|---|---|---|
DATASET_ID | BeIR/scifact | HuggingFace dataset to load |
SPARSE_MODEL_ID | endee/bm25 | BM25 model identifier |
CORPUS_OUTPUT_PATH | scifact_corpus.jsonl | Output file for document embeddings |
QUERIES_OUTPUT_PATH | scifact_queries.jsonl | Output file for query embeddings |
BATCH_SIZE | 256 | Documents per .embed() call |
DATASET_ID = "BeIR/scifact"
SPARSE_MODEL_ID = "endee/bm25"
CORPUS_OUTPUT_PATH = Path("scifact_corpus.jsonl")
QUERIES_OUTPUT_PATH = Path("scifact_queries.jsonl")
BATCH_SIZE = 256Load the SciFact Corpus
SciFact contains ~5,000 PubMed article abstracts. Each record has an _id, a title, and a text field. The text (abstract) is what gets embedded.
print("Loading SciFact corpus ...")
corpus = load_dataset(DATASET_ID, "corpus", split="corpus")
print(f"Corpus loaded: {len(corpus):,} documents")Load the SciFact Queries
SciFact queries are short scientific claim statements used to retrieve supporting or contradicting evidence from the corpus.
print("Loading SciFact queries ...")
queries = load_dataset(DATASET_ID, "queries", split="queries")
print(f"Queries loaded: {len(queries):,} queries")Load the BM25 Sparse Model
SparseModel("endee/bm25") downloads the BM25 vocabulary and precomputed IDF weights on first use and caches them locally. Subsequent runs load from cache instantly.
print(f"Loading {SPARSE_MODEL_ID} ...")
sparse_model = SparseModel(model_name=SPARSE_MODEL_ID)
print("Sparse model ready")What a Sparse Embedding Looks Like
Both .embed() and .query_embed() return a SparseEmbedding with two arrays:
| Attribute | Type | Meaning |
|---|---|---|
.indices | ndarray[int] | Vocabulary token IDs with non-zero BM25 weight |
.values | ndarray[float] | BM25 weight for each token ID |
Only the tokens that actually appear in the text get non-zero entries — everything else is zero and omitted. This is what makes it sparse. A typical abstract produces ~90 non-zero tokens; a short query produces ~9.
{
"id": "4983",
"sparse_vector": {
"indices": [412, 8901, 23445],
"values": [0.82, 1.41, 0.67]
},
"meta": {"text": "...", "title": "..."}
}Create Corpus Embeddings — SparseModel.embed()
embed(documents, batch_size) applies full BM25 document-side weighting: TF × IDF with document-length normalisation. Each result is written to scifact_corpus.jsonl as one JSON record per line.
CORPUS_OUTPUT_PATH.unlink(missing_ok=True)
total_written = 0
total_skipped = 0
with open(CORPUS_OUTPUT_PATH, "w", encoding="utf-8") as f:
for start in tqdm(range(0, len(corpus), BATCH_SIZE), desc="Embedding corpus"):
batch = corpus[start : start + BATCH_SIZE]
texts = batch["text"] if isinstance(batch, dict) else [r["text"] for r in batch]
titles = batch["title"] if isinstance(batch, dict) else [r["title"] for r in batch]
ids = batch["_id"] if isinstance(batch, dict) else [r["_id"] for r in batch]
sparse_vecs = list(sparse_model.embed(texts, batch_size=BATCH_SIZE))
for i, sv in enumerate(sparse_vecs):
if sv is None or not sv.indices.tolist():
total_skipped += 1
continue
record = {
"id": ids[i],
"sparse_vector": {"indices": sv.indices.tolist(), "values": sv.values.tolist()},
"meta": {"text": texts[i], "title": titles[i]},
}
f.write(json.dumps(record) + "\n")
total_written += 1
print(f"\nCorpus embeddings saved → {CORPUS_OUTPUT_PATH}")
print(f" Written : {total_written:,} | Skipped: {total_skipped:,}")Inspect a Corpus Embedding
Read the first line back to confirm the file is well-formed and see what a document-side BM25 sparse vector looks like.
with open(CORPUS_OUTPUT_PATH, "r", encoding="utf-8") as f:
sample = json.loads(f.readline())
print("Corpus sample")
print(f" id : {sample['id']}")
print(f" title : {sample['meta']['title']}")
print(f" text preview : {sample['meta']['text'][:100]}...")
print(f" non-zero tokens : {len(sample['sparse_vector']['indices'])}")
print(f" top-5 indices : {sample['sparse_vector']['indices'][:5]}")
print(f" top-5 values : {[round(v, 4) for v in sample['sparse_vector']['values'][:5]]}")Create Query Embeddings — SparseModel.query_embed()
query_embed(query) applies BM25 query-side weighting: IDF-only, no term-frequency or document-length normalisation. Pass a single query string and consume the returned iterator with next().
Why not use .embed() for queries? .embed() applies length normalisation. A short query like “does vitamin D reduce cancer risk” would be penalised by its low token count, pushing all BM25 weights toward zero.
QUERIES_OUTPUT_PATH.unlink(missing_ok=True)
total_written = 0
total_skipped = 0
with open(QUERIES_OUTPUT_PATH, "w", encoding="utf-8") as f:
for record in tqdm(queries, desc="Embedding queries"):
qid = record["_id"]
text = record["text"]
sv = next(sparse_model.query_embed(text))
if sv is None or not sv.indices.tolist():
total_skipped += 1
continue
entry = {
"id": qid,
"sparse_vector": {"indices": sv.indices.tolist(), "values": sv.values.tolist()},
"meta": {"text": text},
}
f.write(json.dumps(entry) + "\n")
total_written += 1
print(f"\nQuery embeddings saved → {QUERIES_OUTPUT_PATH}")
print(f" Written : {total_written:,} | Skipped: {total_skipped:,}")Inspect a Query Embedding
Read the first query back. Notice queries have far fewer non-zero tokens than documents — typically 5–10 — because queries are shorter. Values are all 1.0 for short queries since there is no term-frequency component.
with open(QUERIES_OUTPUT_PATH, "r", encoding="utf-8") as f:
sample_q = json.loads(f.readline())
print("Query sample")
print(f" id : {sample_q['id']}")
print(f" text : {sample_q['meta']['text']}")
print(f" non-zero tokens : {len(sample_q['sparse_vector']['indices'])}")
print(f" indices : {sample_q['sparse_vector']['indices']}")
print(f" values : {[round(v, 4) for v in sample_q['sparse_vector']['values']]}")Summary Statistics
Verify both output files are complete and compute average sparsity. Documents average ~88 non-zero tokens; queries average ~9.
def file_stats(path):
count, total_nnz = 0, 0
with open(path, "r", encoding="utf-8") as f:
for line in f:
rec = json.loads(line)
total_nnz += len(rec["sparse_vector"]["indices"])
count += 1
return count, (total_nnz / count if count else 0)
corpus_count, corpus_avg = file_stats(CORPUS_OUTPUT_PATH)
queries_count, queries_avg = file_stats(QUERIES_OUTPUT_PATH)
print(f"{'File':<30} {'Records':>10} {'Avg non-zero tokens':>22}")
print("─" * 64)
print(f"{str(CORPUS_OUTPUT_PATH):<30} {corpus_count:>10,} {corpus_avg:>22.1f}")
print(f"{str(QUERIES_OUTPUT_PATH):<30} {queries_count:>10,} {queries_avg:>22.1f}")Important — these embeddings only work with Endee. The BM25 sparse vectors generated here are designed specifically for Endee as the vector database. When creating the index you must set sparse_model="endee_bm25" — this tells Endee’s server to apply the matching IDF weights on its side to pair with the TF weights stored in your JSONL files.
Connect to Endee and Create the Hybrid Index
A hybrid index needs two things beyond a standard dense index:
| Parameter | Value | Why |
|---|---|---|
dimension | 384 | Dense vector size from all-MiniLM-L6-v2 |
space_type | "cosine" | Similarity metric for dense vectors |
sparse_model | "endee_bm25" | Tells Endee to apply BM25 server-side IDF weights |
sparse_model="endee_bm25" is what ties the client-side TF weights to the server-side IDF table. Without it, sparse scores will be incorrect.
from endee import Endee
INDEX_NAME = "scifact_bm25"
DENSE_DIM = 384
SPACE_TYPE = "cosine"
print("Connecting to Endee ...")
client = Endee()
print("Connected\n")
try:
client.delete_index(INDEX_NAME)
print(f" Deleted existing index: {INDEX_NAME}")
except Exception:
pass
client.create_index(
name=INDEX_NAME,
dimension=DENSE_DIM,
space_type=SPACE_TYPE,
sparse_model="endee_bm25",
)
index = client.get_index(INDEX_NAME)
print(f" Created index: {INDEX_NAME}")Index the Corpus
Load each document from scifact_corpus.jsonl, compute its dense vector using all-MiniLM-L6-v2, and upsert both vectors into Endee. Documents are sent in batches of 1,000.
from sentence_transformers import SentenceTransformer
UPSERT_BATCH = 1000
print("Loading dense model for hybrid indexing ...")
dense_model = SentenceTransformer("all-MiniLM-L6-v2")
print(f"Dense model loaded — dim={dense_model.get_sentence_embedding_dimension()}\n")
total_indexed = 0
batch_records = []
batch_texts = []
def flush_batch(records, texts):
dense_vecs = dense_model.encode(texts)
points = []
for rec, dvec in zip(records, dense_vecs):
sv = rec["sparse_vector"]
if not sv["indices"]:
continue
points.append({
"id": rec["id"],
"vector": dvec.tolist(),
"sparse_indices": sv["indices"],
"sparse_values": sv["values"],
"meta": rec["meta"],
})
index.upsert(points)
return len(points)
with open(CORPUS_OUTPUT_PATH, "r", encoding="utf-8") as f:
for line in tqdm(f, desc="Indexing corpus"):
rec = json.loads(line)
if not rec["sparse_vector"]["indices"]:
continue
batch_records.append(rec)
batch_texts.append(rec["meta"]["text"])
if len(batch_records) >= UPSERT_BATCH:
total_indexed += flush_batch(batch_records, batch_texts)
batch_records, batch_texts = [], []
if batch_records:
total_indexed += flush_batch(batch_records, batch_texts)
print(f"\nIndexing complete — {total_indexed:,} documents indexed")Run Hybrid Queries
Each query uses both signals at once:
- Dense vector — encodes the query text with
all-MiniLM-L6-v2(semantic similarity) - Sparse vector — loaded from
scifact_queries.jsonl, produced byquery_embed()(BM25 lexical match)
Endee fuses the two scores server-side using its hybrid ranking algorithm.
TOP_K = 5
query_records = []
with open(QUERIES_OUTPUT_PATH, "r", encoding="utf-8") as f:
for line in f:
query_records.append(json.loads(line))
print(f"Loaded {len(query_records):,} queries\n")
def show_results(hits, label=""):
header = f" {'Rank':<5} {'Doc ID':<12} {'Score':<8} Title"
print(f"\n{'─' * len(header)}")
if label:
print(f" {label}")
print(header)
print(f" {'─' * 80}")
for rank, h in enumerate(hits, 1):
title = h["meta"].get("title", h["meta"].get("text", "")[:60])
print(f" {rank:<5} {h['id']:<12} {h['similarity']:<8.4f} {title}")
print()
for qrec in query_records[:3]:
query_text = qrec["meta"]["text"]
query_dense_vec = dense_model.encode(query_text).tolist()
hits = index.query(
vector=query_dense_vec,
sparse_indices=qrec["sparse_vector"]["indices"],
sparse_values=qrec["sparse_vector"]["values"],
top_k=TOP_K,
)
show_results(hits, label=f'Query [{qrec["id"]}]: {query_text}')Cleanup
Removes the index from Endee. Safe to skip if you want to keep it for further experiments.
try:
client.delete_index(INDEX_NAME)
print(f"Deleted index: {INDEX_NAME}")
except Exception as e:
print(f"Could not delete {INDEX_NAME}: {e}")Output File Format
Both JSONL files follow the same structure:
// A document record → scifact_corpus.jsonl
{
"id": "4983",
"sparse_vector": {
"indices": [412, 8901, 23445],
"values": [0.82, 1.41, 0.67]
},
"meta": {
"text": "Vitamin D supplementation reduces the risk of ...",
"title": "Vitamin D and Cancer Prevention"
}
}
// A query record → scifact_queries.jsonl
{
"id": "1",
"sparse_vector": {
"indices": [412, 23445],
"values": [1.12, 0.94]
},
"meta": {
"text": "Vitamin D supplementation is beneficial for cancer prevention."
}
}| Field | What it stores |
|---|---|
id | Original ID from the dataset |
sparse_vector.indices | IDs of words that appear in the text |
sparse_vector.values | BM25 score for each word |
meta.text | The original text, kept for display |
meta.title | Article title (documents only) |
Key Takeaways
.embed()is for documents — full BM25 scoring with word frequency and length adjustment..query_embed()is for queries — simplified BM25 with no length penalty, so short queries get fair scores.- Never swap them — mixing up the two functions produces wrong BM25 scores.
- Sparse means most values are zero — only words that actually appear in the text get a score. Documents have more non-zero tokens than queries because they’re longer.
- The JSONL files are ready to index — load them directly into Endee with
index.upsert().
Dataset: BeIR/scifact via HuggingFace Datasets. Model: endee/bm25 via endee-model.