Sparse Vectors (BM25)
The endee-model package provides Endee’s BM25 model for generating sparse vector embeddings. These sparse vectors capture keyword/term importance and are used in hybrid search alongside dense embeddings.
Why Sparse Vectors?
| Approach | Finds |
|---|---|
| Dense only | Semantically similar content (different words, similar meaning) |
| Sparse (BM25) only | Exact keyword matches |
| Hybrid | best of both worlds |
Query: "vitamin D cancer prevention"
Dense match → finds docs about "sun exposure reduces tumour risk" (similar meaning)
Sparse match → finds docs containing exact words: vitamin, cancer, prevention
Hybrid → combines both for best resultsInstallation
Python
pip install endee-modelThe endee-model BM25 sparse embedding library is available for Python and TypeScript. Bring your own sparse vectors generated from any BM25 implementation.The sparse_model you set at collection creation controls how Endee interprets these values: use endee_bm25 to send TF weights only (Endee applies IDF server-side), or default to send final scores as-is for SPLADE or custom BM25 models.
Quick Start
Python
from endee_model import SparseModel
# Load the BM25 model
model = SparseModel(model_name="endee/bm25")
# Generate document embeddings (for indexing)
documents = [
"Vitamin D supplementation reduces cancer risk in elderly patients.",
"Machine learning models can predict protein folding accurately.",
]
doc_embeddings = list(model.embed(documents))
# Generate query embedding (for searching)
query = "vitamin D and cancer prevention"
query_embedding = next(model.query_embed(query)).embed() vs .query_embed()
BM25 is asymmetric: documents and queries are weighted differently.
| Method | Use For | TF (term frequency) | IDF (rarity) | Length Norm |
|---|---|---|---|---|
| .embed() | Documents/corpus | Yes | Yes | Yes |
| .query_embed() | Search queries | No | Yes | Yes |
- Documents benefit from TF weighting: if a term appears multiple times, it’s likely important
- Documents are length-normalized: a long document shouldn’t score higher just because it has more words
- Queries use IDF-only weighting: each query term gets equal importance
Rule of thumb: Use .embed() on text you’re storing. Use .query_embed() on text you’re searching with.
Working with Sparse Embeddings
The model returns sparse embedding objects with indices and values arrays:
Python
from endee_model import SparseModel
model = SparseModel(model_name="endee/bm25")
doc = "Vitamin D supplementation reduces cancer risk."
embedding = next(model.embed([doc]))
# Access the sparse vector components
print(f"Indices: {embedding.indices[:5]}...") # Token positions
print(f"Values: {embedding.values[:5]}...") # BM25 weights
# Convert to dictionary format (token_id -> weight)
token_weights = embedding.as_dict()
print(f"Non-zero tokens: {len(token_weights)}")Complete Workflow
Create a Hybrid Collection
Python
from endee import Endee, Precision
client = Endee()
client.create_collection(
name="documents",
dimension=384, # Your dense embedding dimension
sparse_model="endee_bm25", # Enable BM25 sparse vectors
space_type="cosine",
precision=Precision.INT8
)
collection = client.get_collection(name="documents")The sparse_model parameter
For sparse_model you have two options depending on which sparse model you use:
sparse_model="endee_bm25"— use this when your sparse objects come fromendee/bm25. Endee holds the IDF weights on its server and applies them automatically, so you only need to send the TF weights from your client.sparse_model="default"— use this for SPLADE models or any other BM25 model. In this case Endee treats the values you send as final scores and does no further calculation. If you are using another BM25 model (notendee/bm25), you must compute the full IDF scores yourself on the client before sending them.
Generate Embeddings and Upsert
Python
from endee_model import SparseModel
sparse_model = SparseModel(model_name="endee/bm25")
# dense_model = your preferred dense embedding model (e.g., sentence-transformers)
documents = [
"Vitamin D supplementation reduces cancer risk in elderly patients.",
"Machine learning models can predict protein folding accurately.",
"Regular exercise lowers the risk of cardiovascular disease.",
]
sparse_embeddings = list(sparse_model.embed(documents))
vectors = []
for i, (doc, sparse_emb) in enumerate(zip(documents, sparse_embeddings)):
vectors.append({
"id": f"doc_{i}",
"vector": [...], # Your dense embedding here (384-dim)
"sparse_indices": sparse_emb.indices.tolist(),
"sparse_values": sparse_emb.values.tolist(),
"meta": {"text": doc}
})
collection.upsert(vectors)Query with Hybrid Search
Python
query = "vitamin D and cancer prevention"
query_sparse = next(sparse_model.query_embed(query))
# query_dense = dense_model.encode(query)
results = collection.query(
vector=[...], # Your dense query embedding
sparse_indices=query_sparse.indices.tolist(),
sparse_values=query_sparse.values.tolist(),
top_k=5
)
for item in results:
print(f"ID: {item['id']}, Similarity: {item['similarity']:.3f}")Batch Processing
For large datasets, process embeddings in batches and save to JSONL format:
Python
import json
from pathlib import Path
from endee_model import SparseModel
model = SparseModel(model_name="endee/bm25")
documents = [
{"id": "1", "text": "First document content...", "title": "Doc 1"},
{"id": "2", "text": "Second document content...", "title": "Doc 2"},
]
output_path = Path("corpus_embeddings.jsonl")
with open(output_path, "w") as f:
texts = [doc["text"] for doc in documents]
embeddings = model.embed(texts)
for doc, emb in zip(documents, embeddings):
record = {
"id": doc["id"],
"sparse_vector": {
"indices": emb.indices.tolist(),
"values": emb.values.tolist(),
},
"meta": {"text": doc["text"], "title": doc["title"]}
}
f.write(json.dumps(record) + "\n")JSONL record format:
{
"id": "doc_123",
"sparse_vector": {
"indices": [354307472, 794129062, 242156862],
"values": [1.0887, 1.4566, 1.7527]
},
"meta": {
"text": "Original document text...",
"title": "Document Title"
}
}BM25 Scoring
BM25 (Best Matching 25) scores documents based on:
- Term Frequency (TF): how often a term appears (with diminishing returns)
- Inverse Document Frequency (IDF): how rare the term is across the corpus (rare = more informative)
- Length Normalization: prevents long documents from having an unfair advantage
| Parameter | Meaning | Typical Value |
|---|---|---|
| k1 | TF saturation (diminishing returns) | 1.2–2.0 |
| b | Length normalization strength | 0.75 |
Sparse vs Dense Comparison
| Sparse (BM25) | Dense (e.g. MiniLM) | |
|---|---|---|
| Object size | 30,000+ dimensions (vocab size) | 384–1536 dimensions |
| Non-zero entries | Only tokens present in the text | All entries non-zero |
| Dimension meaning | Weight of a specific vocabulary token | Learned semantic feature |
| Matches | Exact keywords | Semantic/conceptual similarity |
| Example | {"cancer": 1.7, "treatment": 1.2} | [0.12, -0.34, 0.89, ...] |
SDK Reference
Model Class
Python
from endee_model import SparseModel
model = SparseModel(model_name="endee/bm25")Methods
Python
| Method | Signature | Returns | Description |
|---|---|---|---|
.embed() | embed(documents: list[str], batch_size: int = 256) | Iterable[SparseEmbedding] | Embed documents (applies TF + length normalization) |
.query_embed() | query_embed(query: str | list[str]) | Iterable[SparseEmbedding] | Embed a query (all term weights set to 1.0) |
.token_count() | token_count(texts: str | list[str]) | int | Total token count across all provided texts |
SparseEmbedding
The object returned by .embed() and .query_embed() / .queryEmbed().
Python
from endee_model import SparseEmbeddingProperties
Python
| Property | Type | Description |
|---|---|---|
.indices | np.ndarray (int32/int64) | Token IDs (positions in the vocabulary) |
.values | np.ndarray (float32) | BM25 TF weights for each token |
Methods
Python
| Method | Returns | Description |
|---|---|---|
.as_dict() | dict[int, float] | {token_id: weight} (useful for inspection) |
.as_object() | dict[str, np.ndarray] | {"indices": array, "values": array} (numpy arrays) |
SparseEmbedding.from_dict(data) | SparseEmbedding | Construct from a {token_id: weight} dict |