Hybrid Search with BM25 and Dense Vectors

Time: 5–10 minLevel: Intermediate

Combine keyword matching (BM25) with semantic search for better retrieval - using the SciFact dataset.

Overview

In this notebook, you will:

Load the SciFact dataset - approximately 5,000 scientific article abstracts and 1,100 search queries
Create BM25 sparse embeddings for both documents and queries
Build a hybrid Endee index that stores both keyword and semantic vectors
Run hybrid queries that combine both signals and compare results

What is BM25? A classic keyword ranking algorithm. It scores documents by how often your search terms appear - and down-weights common words like “the” in favour of rare, specific ones.

What is hybrid search? Combining BM25 (exact keyword matches) with dense vectors (semantic meaning). BM25 catches exact term matches; dense vectors catch synonyms and paraphrases. Together they handle more queries well.

The hybrid query flow:


Query ──► [Dense Embed: all-MiniLM-L6-v2] ──► 384-dim vector ──┐
                                                               ├──► Endee Hybrid ReRanking ──► Top-K
Query ──► [Sparse Embed: BM25]            ──► sparse vector  ──┘

Prerequisites: Endee running locally on http://127.0.0.1:8080

Why Two Separate Embedding Functions?

BM25 treats documents and queries differently. A long article should be scored differently from a 6-word query - applying the same formula to both would unfairly penalize short queries.

Function	Use it for	What it applies
`SparseModel.embed(documents)`	Corpus / documents	Full BM25: word frequency x word rarity, adjusted for document length
`SparseModel.query_embed(query)`	Search queries	Word rarity only - no length penalty

Rule: Always use .embed() for documents and .query_embed() for queries. Mixing them produces incorrect BM25 scores.

Install

Required packages:

endee-model - provides the SparseModel class that generates BM25 sparse vectors
endee - the client library to connect to the Endee vector database
sentence-transformers - provides the dense embedding model all-MiniLM-L6-v2
numpy==2.0.0 - a specific numpy version required to avoid compatibility issues


pip install --upgrade endee-model endee sentence-transformers
pip install numpy==2.0.0

Import


from getpass import getpass
from endee_model import SparseModel
from sentence_transformers import SentenceTransformer
from endee import Endee

Authentication & Create Index

Before connecting, configure your API settings:

Three parameters are needed at index creation:

dimension=384 - size of the dense vectors produced by all-MiniLM-L6-v2
space_type="cosine" - how Endee measures similarity between dense vectors
sparse_model - controls how Endee handles the sparse side

For sparse_model you have two options depending on which sparse model you use:

sparse_model="endee_bm25" - use this when your sparse vectors come from endee/bm25. Endee holds the IDF weights on its server and applies them automatically, so you only need to send the TF weights from your client.
sparse_model="default" - use this for SPLADE models or any other BM25 model. In this case Endee treats the values you send as final scores and does no further calculation. If you are using another BM25 model (not endee/bm25), you must compute the full IDF scores yourself on the client before sending them.

Connection options:

Choose your connection method: local server or serverless cloud.

Local Server: If your server has NDD_AUTH_TOKEN set, pass the same token when initializing:


client = Endee("ndd-auth-token")
client.set_base_url("http://0.0.0.0:8080/api/v1")

Endee Serverless: Go to https://app.endee.io , create a token, then pass it here:


client = Endee("your-serverless-token")

Then create the index:


INDEX_NAME = "example_hybrid"
 
 
# Delete if already exists so we start fresh
try:
    client.delete_index(INDEX_NAME)
except Exception:
    pass
 
client.create_index(
    name=INDEX_NAME,
    dimension=384,
    space_type="cosine",
    sparse_model="endee_bm25",
)
index = client.get_index(INDEX_NAME)
print(f"Index '{INDEX_NAME}' ready")

Prepare Example Data

Create a simple corpus and set of test queries:


CORPUS = [
    {
        "id": "1",
        "title": "Vitamin D and Cancer Risk",
        "text": "Vitamin D supplementation has been shown to reduce the risk of certain cancers. Studies suggest that adequate vitamin D levels in the blood are associated with lower rates of colon and breast cancer."
    },
    ...
]
 
QUERIES = [
    {"id": "q1", "text": "does vitamin D help prevent cancer"},
   ...
]
 
print(f"{len(CORPUS)} documents, {len(QUERIES)} queries ready")

Generate Sparse Embeddings

What a Sparse Embedding Looks Like

Both .embed() and .query_embed() return a SparseEmbedding with two arrays:

Attribute	Type	Meaning
`.indices`	`ndarray[int]`	Vocabulary token IDs with non-zero BM25 weight
`.values`	`ndarray[float]`	BM25 weight for each token ID

Only the tokens that actually appear in the text get non-zero entries - everything else is zero and omitted. This is what makes it sparse. A typical abstract produces about 90 non-zero tokens; a short query produces about 9.

Example sparse vector structure:


{
  "id": "4983",
  "sparse_vector": {
    "indices": [412, 8901, 23445],
    "values":  [0.82, 1.41, 0.67]
  },
  "meta": {"text": "...", "title": "..."}
}

Generate Sparse Embeddings

BM25 scores words differently for documents and queries. The library provides two separate methods:

.embed() for documents - looks at word frequency, word rarity, and adjusts for document length
.query_embed() for queries - only looks at word rarity and skips the length adjustment so short queries are not penalized

Both methods return a sparse vector, which is just a list of words that appeared in the text along with their BM25 scores.


sparse_model = SparseModel(model_name="endee/bm25")
 
doc_texts   = [doc["text"] for doc in CORPUS]
query_texts = [q["text"]   for q in QUERIES]
 
# Documents use .embed() -- full BM25 with length normalisation
corpus_sparse = list(sparse_model.embed(doc_texts))
 
# Queries use .query_embed() -- IDF only, no length penalty
query_sparse  = [next(sparse_model.query_embed(text)) for text in query_texts]
 
print(f"Sparse vectors generated -- {len(corpus_sparse)} docs, {len(query_sparse)} queries")
 
# Quick look at the first document's sparse vector
sv = corpus_sparse[0]
print(f"\nSample -- '{CORPUS[0]['title']}'")
print(f"  non-zero tokens : {len(sv.indices)}")
print(f"  top-5 indices   : {sv.indices[:5].tolist()}")
print(f"  top-5 values    : {[round(v, 3) for v in sv.values[:5].tolist()]}")

Generate Dense Embeddings

all-MiniLM-L6-v2 converts each document into a 384-number vector that captures its meaning. These are used alongside the BM25 sparse vectors so the search can match on meaning, not just exact words.


dense_model = SentenceTransformer("all-MiniLM-L6-v2")
 
corpus_dense = dense_model.encode(doc_texts)
 
print(f"Dense vectors generated -- shape: {corpus_dense.shape}")

Index the Corpus

Each document needs both its dense and sparse vectors bundled together before it can be stored in Endee. Build a list called points where each item holds:

The document id
Its dense vector
Its sparse vector (indices and values)
The original title and text under meta so they show up in search results

Once the list is ready, index.upsert() sends all 6 documents to Endee in a single call and stores them in the index.


points = [
    {
        "id":             doc["id"],
        "vector":         corpus_dense[i].tolist(),
        "sparse_indices": corpus_sparse[i].indices.tolist(),
        "sparse_values":  corpus_sparse[i].values.tolist(),
        "meta":           {"title": doc["title"], "text": doc["text"]},
    }
    for i, doc in enumerate(CORPUS)
]
 
index.upsert(points)
print(f"{len(points)} documents indexed")

Run Hybrid Queries

For each query:

Convert the query text into a dense vector using the dense model
Pick up its pre-computed sparse vector
Pass both vectors to index.query() together
Endee combines the BM25 and dense scores to find the best matches
top_k=3 returns only the top 3 results


TOP_K = 3
 
for i, q in enumerate(QUERIES):
    query_dense_vec = dense_model.encode(q["text"]).tolist()
    sv = query_sparse[i]
 
    hits = index.query(
        vector=query_dense_vec,
        sparse_indices=sv.indices.tolist(),
        sparse_values=sv.values.tolist(),
        top_k=TOP_K,
    )
 
    print(f"Query: {q['text']}")
    for rank, h in enumerate(hits, 1):
        print(f"  {rank}. score={h['similarity']:.4f}  {h['meta']['title']}\n")

Cleanup

Delete the index.


client.delete_index(INDEX_NAME)
print(f"Deleted: {INDEX_NAME}")

Key Takeaways

.embed() is for documents - full BM25 scoring with word frequency and length adjustment
.query_embed() is for queries - simplified BM25 with no length penalty, so short queries get fair scores
Never swap them - mixing up the two functions produces wrong BM25 scores
Sparse means most values are zero - only words that actually appear in the text get a score. Documents have more non-zero tokens than queries because they’re longer
The points format is ready to index - load them directly into Endee with index.upsert()