Endee BM25

The endee-model package provides Endee’s BM25 model for generating sparse vector embeddings. These sparse vectors capture keyword/term importance and can be combined with dense embeddings for powerful hybrid search.

Why BM25 + Dense?

Approach	Finds
Dense only	Semantically similar content (different words, same meaning)
Sparse (BM25) only	Exact keyword matches
Hybrid	Both — best of both worlds


Query: "vitamin D cancer prevention"

Sparse match  → finds docs containing exact words: vitamin, cancer, prevention
Dense match   → finds docs about "sun exposure reduces tumour risk" (same meaning, different words)
Hybrid        → combines both for best results

Installation


pip install endee-model

Quick Start


from endee_model import SparseModel
 
# Load the BM25 model
model = SparseModel(model_name="endee/bm25")
 
# Generate document embeddings (for indexing)
documents = [
    "Vitamin D supplementation reduces cancer risk in elderly patients.",
    "Machine learning models can predict protein folding accurately.",
]
doc_embeddings = list(model.embed(documents))
 
# Generate query embedding (for searching)
query = "vitamin D and cancer prevention"
query_embedding = next(model.query_embed(query))

Understanding `.embed()` vs `.query_embed()`

BM25 is asymmetric — documents and queries are weighted differently.

Method	Use For	TF (term frequency)	IDF (rarity)	Length Norm
`.embed()`	Documents/corpus	Yes	Yes	Yes
`.query_embed()`	Search queries	No	Yes	No

Why the difference?

Documents benefit from TF weighting — if a term appears multiple times, it’s likely important
Documents are length-normalized — a long document shouldn’t score higher just because it has more words
Queries are short (5-10 words typically) — length normalization would unfairly penalize them
Queries use IDF-only weighting — each query term gets equal importance

Rule of thumb: Use .embed() on text you’re storing. Use .query_embed() on text you’re searching with.

Working with Sparse Embeddings

The model returns sparse embedding objects with indices and values arrays:


from endee_model import SparseModel
 
model = SparseModel(model_name="endee/bm25")
 
# Single document
doc = "Vitamin D supplementation reduces cancer risk."
embedding = next(model.embed([doc]))
 
# Access the sparse vector components
print(f"Indices: {embedding.indices[:5]}...")  # Token positions
print(f"Values: {embedding.values[:5]}...")    # BM25 weights
 
# Convert to dictionary format (token_id -> weight)
token_weights = embedding.as_dict()
print(f"Non-zero tokens: {len(token_weights)}")

Complete Workflow: Endee + BM25

Here’s a complete example combining endee-model with the Endee client for hybrid search:

Step 1: Create a Hybrid Index


from endee import Endee, Precision
 
client = Endee()
 
# Create hybrid index with BM25 sparse model
client.create_index(
    name="documents",
    dimension=384,              # Your dense embedding dimension
    sparse_model="endee_bm25",  # Enable BM25 sparse vectors
    space_type="cosine",
    precision=Precision.INT8
)
 
index = client.get_index(name="documents")

Step 2: Generate Embeddings and Upsert


from endee_model import SparseModel
 
# Load models
sparse_model = SparseModel(model_name="endee/bm25")
# dense_model = your preferred dense embedding model (e.g., sentence-transformers)
 
documents = [
    "Vitamin D supplementation reduces cancer risk in elderly patients.",
    "Machine learning models can predict protein folding accurately.",
    "Regular exercise lowers the risk of cardiovascular disease.",
]
 
# Generate sparse embeddings
sparse_embeddings = list(sparse_model.embed(documents))
 
# Prepare vectors for upsert
vectors = []
for i, (doc, sparse_emb) in enumerate(zip(documents, sparse_embeddings)):
    vectors.append({
        "id": f"doc_{i}",
        "vector": [...],  # Your dense embedding here (384-dim)
        "sparse_indices": sparse_emb.indices.tolist(),
        "sparse_values": sparse_emb.values.tolist(),
        "meta": {"text": doc}
    })
 
# Upsert to Endee
index.upsert(vectors)

Step 3: Query with Hybrid Search


query = "vitamin D and cancer prevention"
 
# Generate query embeddings
query_sparse = next(sparse_model.query_embed(query))
# query_dense = your dense model embedding
 
# Hybrid query
results = index.query(
    vector=[...],  # Your dense query embedding
    sparse_indices=query_sparse.indices.tolist(),
    sparse_values=query_sparse.values.tolist(),
    top_k=5
)
 
for item in results:
    print(f"ID: {item['id']}, Similarity: {item['similarity']}")

Batch Processing with JSONL

For large datasets, process embeddings in batches and save to JSONL format:


import json
from pathlib import Path
from endee_model import SparseModel
 
model = SparseModel(model_name="endee/bm25")
 
documents = [
    {"id": "1", "text": "First document content...", "title": "Doc 1"},
    {"id": "2", "text": "Second document content...", "title": "Doc 2"},
    # ... more documents
]
 
output_path = Path("corpus_embeddings.jsonl")
 
with open(output_path, "w") as f:
    texts = [doc["text"] for doc in documents]
    embeddings = model.embed(texts)
 
    for doc, emb in zip(documents, embeddings):
        record = {
            "id": doc["id"],
            "sparse_vector": {
                "indices": emb.indices.tolist(),
                "values": emb.values.tolist(),
            },
            "meta": {
                "text": doc["text"],
                "title": doc["title"]
            }
        }
        f.write(json.dumps(record) + "\n")
 
print(f"Saved {len(documents)} embeddings to {output_path}")

JSONL Record Format:


{
  "id": "doc_123",
  "sparse_vector": {
    "indices": [354307472, 794129062, 242156862],
    "values":  [1.0887, 1.4566, 1.7527]
  },
  "meta": {
    "text": "Original document text...",
    "title": "Document Title"
  }
}

BM25 Scoring Explained

BM25 (Best Match 25) scores documents based on:

Term Frequency (TF) — How often a term appears in the document (with diminishing returns)
Inverse Document Frequency (IDF) — How rare the term is across all documents (rare = more informative)
Length Normalization — Prevents long documents from having an unfair advantage


BM25 score = Σ  IDF(term) × TF(term, doc) × (k1 + 1)
                            ──────────────────────────
                            TF(term, doc) + k1×(1 - b + b × doc_len/avg_len)

Parameter	Meaning	Typical Value
`k1`	TF saturation (diminishing returns)	1.2–2.0
`b`	Length normalization strength	0.75

Sparse vs Dense Embeddings

	Sparse (BM25)	Dense (e.g. MiniLM)
Vector size	30,000+ dimensions (vocab size)	384–1536 dimensions
Non-zero entries	Only tokens in the text	All entries non-zero
Dimension meaning	Weight of a specific vocabulary token	Learned semantic feature
Matches	Exact keywords	Semantic/conceptual similarity
Example	`{"cancer": 1.7, "treatment": 1.2}`	`[0.12, -0.34, 0.89, ...]`

API Reference

SparseModel


model = SparseModel(model_name="endee/bm25")

Method	Description
`.embed(texts)`	Generate document embeddings (TF × IDF + length norm)
`.query_embed(text)`	Generate query embedding (IDF only)

SparseEmbedding

Property/Method	Description
`.indices`	NumPy array of non-zero token positions
`.values`	NumPy array of BM25 weights
`.as_dict()`	Dictionary mapping token_id → weight