Skip to Content
TutorialsHybrid Search with BM25

Hybrid Search with BM25 and Dense Vectors

Time: 20–30 minLevel: IntermediateOpen In Colab

In this tutorial, you will:

  • Load the SciFact dataset — ~5,000 scientific article abstracts and 1,100 search queries
  • Create BM25 sparse embeddings for both documents and queries
  • Build a hybrid Endee index that stores both keyword and semantic vectors
  • Run hybrid queries that combine both signals and compare results

What is BM25? A classic keyword ranking algorithm. It scores documents by how often your search terms appear — and down-weights common words like “the” in favour of rare, specific ones.

What is hybrid search? Combining BM25 (exact keyword matches) with dense vectors (semantic meaning). BM25 catches exact term matches; dense vectors catch synonyms and paraphrases. Together they handle more queries well.

Query ──► [Dense Embed: all-MiniLM-L6-v2] ──► 384-dim vector ──┐ ├──► Endee Hybrid Rank ──► Top-K Query ──► [Sparse Embed: BM25] ──► sparse vector ──┘

Prerequisites: Endee running locally on http://127.0.0.1:8080


Why Two Separate Embedding Functions?

BM25 treats documents and queries differently. A long article should be scored differently from a 6-word query — applying the same formula to both would unfairly penalise short queries.

FunctionUse it forWhat it applies
SparseModel.embed(documents)Corpus / documentsFull BM25: word frequency × word rarity, adjusted for document length
SparseModel.query_embed(query)Search queriesWord rarity only — no length penalty

Rule: always use .embed() for documents and .query_embed() for queries. Mixing them produces incorrect BM25 scores.


Install Dependencies

Install endee and endee-model for the vector store and BM25 model, sentence-transformers for dense embeddings, and datasets to load SciFact from HuggingFace.

pip install "datasets<3.0.0" endee-model tqdm endee sentence-transformers

Import Libraries

import json from pathlib import Path from datasets import load_dataset from endee_model import SparseModel from tqdm import tqdm print("Imports OK")

Configuration

VariableDefaultPurpose
DATASET_IDBeIR/scifactHuggingFace dataset to load
SPARSE_MODEL_IDendee/bm25BM25 model identifier
CORPUS_OUTPUT_PATHscifact_corpus.jsonlOutput file for document embeddings
QUERIES_OUTPUT_PATHscifact_queries.jsonlOutput file for query embeddings
BATCH_SIZE256Documents per .embed() call
DATASET_ID = "BeIR/scifact" SPARSE_MODEL_ID = "endee/bm25" CORPUS_OUTPUT_PATH = Path("scifact_corpus.jsonl") QUERIES_OUTPUT_PATH = Path("scifact_queries.jsonl") BATCH_SIZE = 256

Load the SciFact Corpus

SciFact contains ~5,000 PubMed article abstracts. Each record has an _id, a title, and a text field. The text (abstract) is what gets embedded.

print("Loading SciFact corpus ...") corpus = load_dataset(DATASET_ID, "corpus", split="corpus") print(f"Corpus loaded: {len(corpus):,} documents")

Load the SciFact Queries

SciFact queries are short scientific claim statements used to retrieve supporting or contradicting evidence from the corpus.

print("Loading SciFact queries ...") queries = load_dataset(DATASET_ID, "queries", split="queries") print(f"Queries loaded: {len(queries):,} queries")

Load the BM25 Sparse Model

SparseModel("endee/bm25") downloads the BM25 vocabulary and precomputed IDF weights on first use and caches them locally. Subsequent runs load from cache instantly.

print(f"Loading {SPARSE_MODEL_ID} ...") sparse_model = SparseModel(model_name=SPARSE_MODEL_ID) print("Sparse model ready")

What a Sparse Embedding Looks Like

Both .embed() and .query_embed() return a SparseEmbedding with two arrays:

AttributeTypeMeaning
.indicesndarray[int]Vocabulary token IDs with non-zero BM25 weight
.valuesndarray[float]BM25 weight for each token ID

Only the tokens that actually appear in the text get non-zero entries — everything else is zero and omitted. This is what makes it sparse. A typical abstract produces ~90 non-zero tokens; a short query produces ~9.

{ "id": "4983", "sparse_vector": { "indices": [412, 8901, 23445], "values": [0.82, 1.41, 0.67] }, "meta": {"text": "...", "title": "..."} }

Create Corpus Embeddings — SparseModel.embed()

embed(documents, batch_size) applies full BM25 document-side weighting: TF × IDF with document-length normalisation. Each result is written to scifact_corpus.jsonl as one JSON record per line.

CORPUS_OUTPUT_PATH.unlink(missing_ok=True) total_written = 0 total_skipped = 0 with open(CORPUS_OUTPUT_PATH, "w", encoding="utf-8") as f: for start in tqdm(range(0, len(corpus), BATCH_SIZE), desc="Embedding corpus"): batch = corpus[start : start + BATCH_SIZE] texts = batch["text"] if isinstance(batch, dict) else [r["text"] for r in batch] titles = batch["title"] if isinstance(batch, dict) else [r["title"] for r in batch] ids = batch["_id"] if isinstance(batch, dict) else [r["_id"] for r in batch] sparse_vecs = list(sparse_model.embed(texts, batch_size=BATCH_SIZE)) for i, sv in enumerate(sparse_vecs): if sv is None or not sv.indices.tolist(): total_skipped += 1 continue record = { "id": ids[i], "sparse_vector": {"indices": sv.indices.tolist(), "values": sv.values.tolist()}, "meta": {"text": texts[i], "title": titles[i]}, } f.write(json.dumps(record) + "\n") total_written += 1 print(f"\nCorpus embeddings saved → {CORPUS_OUTPUT_PATH}") print(f" Written : {total_written:,} | Skipped: {total_skipped:,}")

Inspect a Corpus Embedding

Read the first line back to confirm the file is well-formed and see what a document-side BM25 sparse vector looks like.

with open(CORPUS_OUTPUT_PATH, "r", encoding="utf-8") as f: sample = json.loads(f.readline()) print("Corpus sample") print(f" id : {sample['id']}") print(f" title : {sample['meta']['title']}") print(f" text preview : {sample['meta']['text'][:100]}...") print(f" non-zero tokens : {len(sample['sparse_vector']['indices'])}") print(f" top-5 indices : {sample['sparse_vector']['indices'][:5]}") print(f" top-5 values : {[round(v, 4) for v in sample['sparse_vector']['values'][:5]]}")

Create Query Embeddings — SparseModel.query_embed()

query_embed(query) applies BM25 query-side weighting: IDF-only, no term-frequency or document-length normalisation. Pass a single query string and consume the returned iterator with next().

Why not use .embed() for queries? .embed() applies length normalisation. A short query like “does vitamin D reduce cancer risk” would be penalised by its low token count, pushing all BM25 weights toward zero.

QUERIES_OUTPUT_PATH.unlink(missing_ok=True) total_written = 0 total_skipped = 0 with open(QUERIES_OUTPUT_PATH, "w", encoding="utf-8") as f: for record in tqdm(queries, desc="Embedding queries"): qid = record["_id"] text = record["text"] sv = next(sparse_model.query_embed(text)) if sv is None or not sv.indices.tolist(): total_skipped += 1 continue entry = { "id": qid, "sparse_vector": {"indices": sv.indices.tolist(), "values": sv.values.tolist()}, "meta": {"text": text}, } f.write(json.dumps(entry) + "\n") total_written += 1 print(f"\nQuery embeddings saved → {QUERIES_OUTPUT_PATH}") print(f" Written : {total_written:,} | Skipped: {total_skipped:,}")

Inspect a Query Embedding

Read the first query back. Notice queries have far fewer non-zero tokens than documents — typically 5–10 — because queries are shorter. Values are all 1.0 for short queries since there is no term-frequency component.

with open(QUERIES_OUTPUT_PATH, "r", encoding="utf-8") as f: sample_q = json.loads(f.readline()) print("Query sample") print(f" id : {sample_q['id']}") print(f" text : {sample_q['meta']['text']}") print(f" non-zero tokens : {len(sample_q['sparse_vector']['indices'])}") print(f" indices : {sample_q['sparse_vector']['indices']}") print(f" values : {[round(v, 4) for v in sample_q['sparse_vector']['values']]}")

Summary Statistics

Verify both output files are complete and compute average sparsity. Documents average ~88 non-zero tokens; queries average ~9.

def file_stats(path): count, total_nnz = 0, 0 with open(path, "r", encoding="utf-8") as f: for line in f: rec = json.loads(line) total_nnz += len(rec["sparse_vector"]["indices"]) count += 1 return count, (total_nnz / count if count else 0) corpus_count, corpus_avg = file_stats(CORPUS_OUTPUT_PATH) queries_count, queries_avg = file_stats(QUERIES_OUTPUT_PATH) print(f"{'File':<30} {'Records':>10} {'Avg non-zero tokens':>22}") print("─" * 64) print(f"{str(CORPUS_OUTPUT_PATH):<30} {corpus_count:>10,} {corpus_avg:>22.1f}") print(f"{str(QUERIES_OUTPUT_PATH):<30} {queries_count:>10,} {queries_avg:>22.1f}")

Important — these embeddings only work with Endee. The BM25 sparse vectors generated here are designed specifically for Endee as the vector database. When creating the index you must set sparse_model="endee_bm25" — this tells Endee’s server to apply the matching IDF weights on its side to pair with the TF weights stored in your JSONL files.

Connect to Endee and Create the Hybrid Index

A hybrid index needs two things beyond a standard dense index:

ParameterValueWhy
dimension384Dense vector size from all-MiniLM-L6-v2
space_type"cosine"Similarity metric for dense vectors
sparse_model"endee_bm25"Tells Endee to apply BM25 server-side IDF weights

sparse_model="endee_bm25" is what ties the client-side TF weights to the server-side IDF table. Without it, sparse scores will be incorrect.

from endee import Endee INDEX_NAME = "scifact_bm25" DENSE_DIM = 384 SPACE_TYPE = "cosine" print("Connecting to Endee ...") client = Endee() print("Connected\n") try: client.delete_index(INDEX_NAME) print(f" Deleted existing index: {INDEX_NAME}") except Exception: pass client.create_index( name=INDEX_NAME, dimension=DENSE_DIM, space_type=SPACE_TYPE, sparse_model="endee_bm25", ) index = client.get_index(INDEX_NAME) print(f" Created index: {INDEX_NAME}")

Index the Corpus

Load each document from scifact_corpus.jsonl, compute its dense vector using all-MiniLM-L6-v2, and upsert both vectors into Endee. Documents are sent in batches of 1,000.

from sentence_transformers import SentenceTransformer UPSERT_BATCH = 1000 print("Loading dense model for hybrid indexing ...") dense_model = SentenceTransformer("all-MiniLM-L6-v2") print(f"Dense model loaded — dim={dense_model.get_sentence_embedding_dimension()}\n") total_indexed = 0 batch_records = [] batch_texts = [] def flush_batch(records, texts): dense_vecs = dense_model.encode(texts) points = [] for rec, dvec in zip(records, dense_vecs): sv = rec["sparse_vector"] if not sv["indices"]: continue points.append({ "id": rec["id"], "vector": dvec.tolist(), "sparse_indices": sv["indices"], "sparse_values": sv["values"], "meta": rec["meta"], }) index.upsert(points) return len(points) with open(CORPUS_OUTPUT_PATH, "r", encoding="utf-8") as f: for line in tqdm(f, desc="Indexing corpus"): rec = json.loads(line) if not rec["sparse_vector"]["indices"]: continue batch_records.append(rec) batch_texts.append(rec["meta"]["text"]) if len(batch_records) >= UPSERT_BATCH: total_indexed += flush_batch(batch_records, batch_texts) batch_records, batch_texts = [], [] if batch_records: total_indexed += flush_batch(batch_records, batch_texts) print(f"\nIndexing complete — {total_indexed:,} documents indexed")

Run Hybrid Queries

Each query uses both signals at once:

  • Dense vector — encodes the query text with all-MiniLM-L6-v2 (semantic similarity)
  • Sparse vector — loaded from scifact_queries.jsonl, produced by query_embed() (BM25 lexical match)

Endee fuses the two scores server-side using its hybrid ranking algorithm.

TOP_K = 5 query_records = [] with open(QUERIES_OUTPUT_PATH, "r", encoding="utf-8") as f: for line in f: query_records.append(json.loads(line)) print(f"Loaded {len(query_records):,} queries\n") def show_results(hits, label=""): header = f" {'Rank':<5} {'Doc ID':<12} {'Score':<8} Title" print(f"\n{'─' * len(header)}") if label: print(f" {label}") print(header) print(f" {'─' * 80}") for rank, h in enumerate(hits, 1): title = h["meta"].get("title", h["meta"].get("text", "")[:60]) print(f" {rank:<5} {h['id']:<12} {h['similarity']:<8.4f} {title}") print() for qrec in query_records[:3]: query_text = qrec["meta"]["text"] query_dense_vec = dense_model.encode(query_text).tolist() hits = index.query( vector=query_dense_vec, sparse_indices=qrec["sparse_vector"]["indices"], sparse_values=qrec["sparse_vector"]["values"], top_k=TOP_K, ) show_results(hits, label=f'Query [{qrec["id"]}]: {query_text}')

Cleanup

Removes the index from Endee. Safe to skip if you want to keep it for further experiments.

try: client.delete_index(INDEX_NAME) print(f"Deleted index: {INDEX_NAME}") except Exception as e: print(f"Could not delete {INDEX_NAME}: {e}")

Output File Format

Both JSONL files follow the same structure:

// A document record → scifact_corpus.jsonl { "id": "4983", "sparse_vector": { "indices": [412, 8901, 23445], "values": [0.82, 1.41, 0.67] }, "meta": { "text": "Vitamin D supplementation reduces the risk of ...", "title": "Vitamin D and Cancer Prevention" } } // A query record → scifact_queries.jsonl { "id": "1", "sparse_vector": { "indices": [412, 23445], "values": [1.12, 0.94] }, "meta": { "text": "Vitamin D supplementation is beneficial for cancer prevention." } }
FieldWhat it stores
idOriginal ID from the dataset
sparse_vector.indicesIDs of words that appear in the text
sparse_vector.valuesBM25 score for each word
meta.textThe original text, kept for display
meta.titleArticle title (documents only)

Key Takeaways

  • .embed() is for documents — full BM25 scoring with word frequency and length adjustment.
  • .query_embed() is for queries — simplified BM25 with no length penalty, so short queries get fair scores.
  • Never swap them — mixing up the two functions produces wrong BM25 scores.
  • Sparse means most values are zero — only words that actually appear in the text get a score. Documents have more non-zero tokens than queries because they’re longer.
  • The JSONL files are ready to index — load them directly into Endee with index.upsert().

Dataset: BeIR/scifact via HuggingFace Datasets. Model: endee/bm25 via endee-model.