Endee BM25
The endee-model package provides Endee’s BM25 model for generating sparse vector embeddings. These sparse vectors capture keyword/term importance and can be combined with dense embeddings for powerful hybrid search.
Why BM25 + Dense?
| Approach | Finds |
|---|---|
| Dense only | Semantically similar content (different words, same meaning) |
| Sparse (BM25) only | Exact keyword matches |
| Hybrid | Both — best of both worlds |
Query: "vitamin D cancer prevention"
Sparse match → finds docs containing exact words: vitamin, cancer, prevention
Dense match → finds docs about "sun exposure reduces tumour risk" (same meaning, different words)
Hybrid → combines both for best resultsInstallation
pip install endee-modelQuick Start
from endee_model import SparseModel
# Load the BM25 model
model = SparseModel(model_name="endee/bm25")
# Generate document embeddings (for indexing)
documents = [
"Vitamin D supplementation reduces cancer risk in elderly patients.",
"Machine learning models can predict protein folding accurately.",
]
doc_embeddings = list(model.embed(documents))
# Generate query embedding (for searching)
query = "vitamin D and cancer prevention"
query_embedding = next(model.query_embed(query))Understanding .embed() vs .query_embed()
BM25 is asymmetric — documents and queries are weighted differently.
| Method | Use For | TF (term frequency) | IDF (rarity) | Length Norm |
|---|---|---|---|---|
.embed() | Documents/corpus | Yes | Yes | Yes |
.query_embed() | Search queries | No | Yes | No |
Why the difference?
- Documents benefit from TF weighting — if a term appears multiple times, it’s likely important
- Documents are length-normalized — a long document shouldn’t score higher just because it has more words
- Queries are short (5-10 words typically) — length normalization would unfairly penalize them
- Queries use IDF-only weighting — each query term gets equal importance
Rule of thumb: Use .embed() on text you’re storing. Use .query_embed() on text you’re searching with.
Working with Sparse Embeddings
The model returns sparse embedding objects with indices and values arrays:
from endee_model import SparseModel
model = SparseModel(model_name="endee/bm25")
# Single document
doc = "Vitamin D supplementation reduces cancer risk."
embedding = next(model.embed([doc]))
# Access the sparse vector components
print(f"Indices: {embedding.indices[:5]}...") # Token positions
print(f"Values: {embedding.values[:5]}...") # BM25 weights
# Convert to dictionary format (token_id -> weight)
token_weights = embedding.as_dict()
print(f"Non-zero tokens: {len(token_weights)}")Complete Workflow: Endee + BM25
Here’s a complete example combining endee-model with the Endee client for hybrid search:
Step 1: Create a Hybrid Index
from endee import Endee, Precision
client = Endee()
# Create hybrid index with BM25 sparse model
client.create_index(
name="documents",
dimension=384, # Your dense embedding dimension
sparse_model="endee_bm25", # Enable BM25 sparse vectors
space_type="cosine",
precision=Precision.INT8
)
index = client.get_index(name="documents")Step 2: Generate Embeddings and Upsert
from endee_model import SparseModel
# Load models
sparse_model = SparseModel(model_name="endee/bm25")
# dense_model = your preferred dense embedding model (e.g., sentence-transformers)
documents = [
"Vitamin D supplementation reduces cancer risk in elderly patients.",
"Machine learning models can predict protein folding accurately.",
"Regular exercise lowers the risk of cardiovascular disease.",
]
# Generate sparse embeddings
sparse_embeddings = list(sparse_model.embed(documents))
# Prepare vectors for upsert
vectors = []
for i, (doc, sparse_emb) in enumerate(zip(documents, sparse_embeddings)):
vectors.append({
"id": f"doc_{i}",
"vector": [...], # Your dense embedding here (384-dim)
"sparse_indices": sparse_emb.indices.tolist(),
"sparse_values": sparse_emb.values.tolist(),
"meta": {"text": doc}
})
# Upsert to Endee
index.upsert(vectors)Step 3: Query with Hybrid Search
query = "vitamin D and cancer prevention"
# Generate query embeddings
query_sparse = next(sparse_model.query_embed(query))
# query_dense = your dense model embedding
# Hybrid query
results = index.query(
vector=[...], # Your dense query embedding
sparse_indices=query_sparse.indices.tolist(),
sparse_values=query_sparse.values.tolist(),
top_k=5
)
for item in results:
print(f"ID: {item['id']}, Similarity: {item['similarity']}")Batch Processing with JSONL
For large datasets, process embeddings in batches and save to JSONL format:
import json
from pathlib import Path
from endee_model import SparseModel
model = SparseModel(model_name="endee/bm25")
documents = [
{"id": "1", "text": "First document content...", "title": "Doc 1"},
{"id": "2", "text": "Second document content...", "title": "Doc 2"},
# ... more documents
]
output_path = Path("corpus_embeddings.jsonl")
with open(output_path, "w") as f:
texts = [doc["text"] for doc in documents]
embeddings = model.embed(texts)
for doc, emb in zip(documents, embeddings):
record = {
"id": doc["id"],
"sparse_vector": {
"indices": emb.indices.tolist(),
"values": emb.values.tolist(),
},
"meta": {
"text": doc["text"],
"title": doc["title"]
}
}
f.write(json.dumps(record) + "\n")
print(f"Saved {len(documents)} embeddings to {output_path}")JSONL Record Format:
{
"id": "doc_123",
"sparse_vector": {
"indices": [354307472, 794129062, 242156862],
"values": [1.0887, 1.4566, 1.7527]
},
"meta": {
"text": "Original document text...",
"title": "Document Title"
}
}BM25 Scoring Explained
BM25 (Best Match 25) scores documents based on:
- Term Frequency (TF) — How often a term appears in the document (with diminishing returns)
- Inverse Document Frequency (IDF) — How rare the term is across all documents (rare = more informative)
- Length Normalization — Prevents long documents from having an unfair advantage
BM25 score = Σ IDF(term) × TF(term, doc) × (k1 + 1)
──────────────────────────
TF(term, doc) + k1×(1 - b + b × doc_len/avg_len)| Parameter | Meaning | Typical Value |
|---|---|---|
k1 | TF saturation (diminishing returns) | 1.2–2.0 |
b | Length normalization strength | 0.75 |
Sparse vs Dense Embeddings
| Sparse (BM25) | Dense (e.g. MiniLM) | |
|---|---|---|
| Vector size | 30,000+ dimensions (vocab size) | 384–1536 dimensions |
| Non-zero entries | Only tokens in the text | All entries non-zero |
| Dimension meaning | Weight of a specific vocabulary token | Learned semantic feature |
| Matches | Exact keywords | Semantic/conceptual similarity |
| Example | {"cancer": 1.7, "treatment": 1.2} | [0.12, -0.34, 0.89, ...] |
API Reference
SparseModel
model = SparseModel(model_name="endee/bm25")| Method | Description |
|---|---|
.embed(texts) | Generate document embeddings (TF × IDF + length norm) |
.query_embed(text) | Generate query embedding (IDF only) |
SparseEmbedding
| Property/Method | Description |
|---|---|
.indices | NumPy array of non-zero token positions |
.values | NumPy array of BM25 weights |
.as_dict() | Dictionary mapping token_id → weight |