Skip to Content
Embedding ModelsEndee BM25

Endee BM25

The endee-model package provides Endee’s BM25 model for generating sparse vector embeddings. These sparse vectors capture keyword/term importance and can be combined with dense embeddings for powerful hybrid search.

Why BM25 + Dense?

ApproachFinds
Dense onlySemantically similar content (different words, same meaning)
Sparse (BM25) onlyExact keyword matches
HybridBoth — best of both worlds
Query: "vitamin D cancer prevention" Sparse match → finds docs containing exact words: vitamin, cancer, prevention Dense match → finds docs about "sun exposure reduces tumour risk" (same meaning, different words) Hybrid → combines both for best results

Installation

pip install endee-model

Quick Start

from endee_model import SparseModel # Load the BM25 model model = SparseModel(model_name="endee/bm25") # Generate document embeddings (for indexing) documents = [ "Vitamin D supplementation reduces cancer risk in elderly patients.", "Machine learning models can predict protein folding accurately.", ] doc_embeddings = list(model.embed(documents)) # Generate query embedding (for searching) query = "vitamin D and cancer prevention" query_embedding = next(model.query_embed(query))

Understanding .embed() vs .query_embed()

BM25 is asymmetric — documents and queries are weighted differently.

MethodUse ForTF (term frequency)IDF (rarity)Length Norm
.embed()Documents/corpusYesYesYes
.query_embed()Search queriesNoYesNo

Why the difference?

  • Documents benefit from TF weighting — if a term appears multiple times, it’s likely important
  • Documents are length-normalized — a long document shouldn’t score higher just because it has more words
  • Queries are short (5-10 words typically) — length normalization would unfairly penalize them
  • Queries use IDF-only weighting — each query term gets equal importance

Rule of thumb: Use .embed() on text you’re storing. Use .query_embed() on text you’re searching with.

Working with Sparse Embeddings

The model returns sparse embedding objects with indices and values arrays:

from endee_model import SparseModel model = SparseModel(model_name="endee/bm25") # Single document doc = "Vitamin D supplementation reduces cancer risk." embedding = next(model.embed([doc])) # Access the sparse vector components print(f"Indices: {embedding.indices[:5]}...") # Token positions print(f"Values: {embedding.values[:5]}...") # BM25 weights # Convert to dictionary format (token_id -> weight) token_weights = embedding.as_dict() print(f"Non-zero tokens: {len(token_weights)}")

Complete Workflow: Endee + BM25

Here’s a complete example combining endee-model with the Endee client for hybrid search:

Step 1: Create a Hybrid Index

from endee import Endee, Precision client = Endee() # Create hybrid index with BM25 sparse model client.create_index( name="documents", dimension=384, # Your dense embedding dimension sparse_model="endee_bm25", # Enable BM25 sparse vectors space_type="cosine", precision=Precision.INT8 ) index = client.get_index(name="documents")

Step 2: Generate Embeddings and Upsert

from endee_model import SparseModel # Load models sparse_model = SparseModel(model_name="endee/bm25") # dense_model = your preferred dense embedding model (e.g., sentence-transformers) documents = [ "Vitamin D supplementation reduces cancer risk in elderly patients.", "Machine learning models can predict protein folding accurately.", "Regular exercise lowers the risk of cardiovascular disease.", ] # Generate sparse embeddings sparse_embeddings = list(sparse_model.embed(documents)) # Prepare vectors for upsert vectors = [] for i, (doc, sparse_emb) in enumerate(zip(documents, sparse_embeddings)): vectors.append({ "id": f"doc_{i}", "vector": [...], # Your dense embedding here (384-dim) "sparse_indices": sparse_emb.indices.tolist(), "sparse_values": sparse_emb.values.tolist(), "meta": {"text": doc} }) # Upsert to Endee index.upsert(vectors)
query = "vitamin D and cancer prevention" # Generate query embeddings query_sparse = next(sparse_model.query_embed(query)) # query_dense = your dense model embedding # Hybrid query results = index.query( vector=[...], # Your dense query embedding sparse_indices=query_sparse.indices.tolist(), sparse_values=query_sparse.values.tolist(), top_k=5 ) for item in results: print(f"ID: {item['id']}, Similarity: {item['similarity']}")

Batch Processing with JSONL

For large datasets, process embeddings in batches and save to JSONL format:

import json from pathlib import Path from endee_model import SparseModel model = SparseModel(model_name="endee/bm25") documents = [ {"id": "1", "text": "First document content...", "title": "Doc 1"}, {"id": "2", "text": "Second document content...", "title": "Doc 2"}, # ... more documents ] output_path = Path("corpus_embeddings.jsonl") with open(output_path, "w") as f: texts = [doc["text"] for doc in documents] embeddings = model.embed(texts) for doc, emb in zip(documents, embeddings): record = { "id": doc["id"], "sparse_vector": { "indices": emb.indices.tolist(), "values": emb.values.tolist(), }, "meta": { "text": doc["text"], "title": doc["title"] } } f.write(json.dumps(record) + "\n") print(f"Saved {len(documents)} embeddings to {output_path}")

JSONL Record Format:

{ "id": "doc_123", "sparse_vector": { "indices": [354307472, 794129062, 242156862], "values": [1.0887, 1.4566, 1.7527] }, "meta": { "text": "Original document text...", "title": "Document Title" } }

BM25 Scoring Explained

BM25 (Best Match 25) scores documents based on:

  1. Term Frequency (TF) — How often a term appears in the document (with diminishing returns)
  2. Inverse Document Frequency (IDF) — How rare the term is across all documents (rare = more informative)
  3. Length Normalization — Prevents long documents from having an unfair advantage
BM25 score = Σ IDF(term) × TF(term, doc) × (k1 + 1) ────────────────────────── TF(term, doc) + k1×(1 - b + b × doc_len/avg_len)
ParameterMeaningTypical Value
k1TF saturation (diminishing returns)1.2–2.0
bLength normalization strength0.75

Sparse vs Dense Embeddings

Sparse (BM25)Dense (e.g. MiniLM)
Vector size30,000+ dimensions (vocab size)384–1536 dimensions
Non-zero entriesOnly tokens in the textAll entries non-zero
Dimension meaningWeight of a specific vocabulary tokenLearned semantic feature
MatchesExact keywordsSemantic/conceptual similarity
Example{"cancer": 1.7, "treatment": 1.2}[0.12, -0.34, 0.89, ...]

API Reference

SparseModel

model = SparseModel(model_name="endee/bm25")
MethodDescription
.embed(texts)Generate document embeddings (TF × IDF + length norm)
.query_embed(text)Generate query embedding (IDF only)

SparseEmbedding

Property/MethodDescription
.indicesNumPy array of non-zero token positions
.valuesNumPy array of BM25 weights
.as_dict()Dictionary mapping token_id → weight