Skip to Content
Endee ToolsSparse Vectors (BM25)

Sparse Vectors (BM25)

The endee-model package provides Endee’s BM25 model for generating sparse vector embeddings. These sparse vectors capture keyword/term importance and are used in hybrid search alongside dense embeddings.

Why Sparse Vectors?

ApproachFinds
Dense onlySemantically similar content (different words, similar meaning)
Sparse (BM25) onlyExact keyword matches
Hybridbest of both worlds
Query: "vitamin D cancer prevention" Dense match → finds docs about "sun exposure reduces tumour risk" (similar meaning) Sparse match → finds docs containing exact words: vitamin, cancer, prevention Hybrid → combines both for best results

Installation

pip install endee-model

The endee-model BM25 sparse embedding library is currently available for Python and TypeScript only. Java and Go support hybrid indexes using the default sparse model — bring your own sparse vectors generated from any BM25 implementation.

Quick Start

from endee_model import SparseModel # Load the BM25 model model = SparseModel(model_name="endee/bm25") # Generate document embeddings (for indexing) documents = [ "Vitamin D supplementation reduces cancer risk in elderly patients.", "Machine learning models can predict protein folding accurately.", ] doc_embeddings = list(model.embed(documents)) # Generate query embedding (for searching) query = "vitamin D and cancer prevention" query_embedding = next(model.query_embed(query))

.embed() vs .query_embed()

BM25 is asymmetric — documents and queries are weighted differently.

MethodUse ForTF (term frequency)IDF (rarity)Length Norm
.embed()Documents/corpusYesYesYes
.query_embed()Search queriesNoYesYes
  • Documents benefit from TF weighting — if a term appears multiple times, it’s likely important
  • Documents are length-normalized — a long document shouldn’t score higher just because it has more words
  • Queries use IDF-only weighting — each query term gets equal importance

Rule of thumb: Use .embed() on text you’re storing. Use .query_embed() on text you’re searching with.

Working with Sparse Embeddings

The model returns sparse embedding objects with indices and values arrays:

from endee_model import SparseModel model = SparseModel(model_name="endee/bm25") doc = "Vitamin D supplementation reduces cancer risk." embedding = next(model.embed([doc])) # Access the sparse vector components print(f"Indices: {embedding.indices[:5]}...") # Token positions print(f"Values: {embedding.values[:5]}...") # BM25 weights # Convert to dictionary format (token_id -> weight) token_weights = embedding.as_dict() print(f"Non-zero tokens: {len(token_weights)}")

Complete Workflow

Create a Hybrid Index

from endee import Endee, Precision client = Endee() client.create_index( name="documents", dimension=384, # Your dense embedding dimension sparse_model="endee_bm25", # Enable BM25 sparse vectors space_type="cosine", precision=Precision.INT8 ) index = client.get_index(name="documents")

Generate Embeddings and Upsert

from endee_model import SparseModel sparse_model = SparseModel(model_name="endee/bm25") # dense_model = your preferred dense embedding model (e.g., sentence-transformers) documents = [ "Vitamin D supplementation reduces cancer risk in elderly patients.", "Machine learning models can predict protein folding accurately.", "Regular exercise lowers the risk of cardiovascular disease.", ] sparse_embeddings = list(sparse_model.embed(documents)) vectors = [] for i, (doc, sparse_emb) in enumerate(zip(documents, sparse_embeddings)): vectors.append({ "id": f"doc_{i}", "vector": [...], # Your dense embedding here (384-dim) "sparse_indices": sparse_emb.indices.tolist(), "sparse_values": sparse_emb.values.tolist(), "meta": {"text": doc} }) index.upsert(vectors)
query = "vitamin D and cancer prevention" query_sparse = next(sparse_model.query_embed(query)) # query_dense = dense_model.encode(query) results = index.query( vector=[...], # Your dense query embedding sparse_indices=query_sparse.indices.tolist(), sparse_values=query_sparse.values.tolist(), top_k=5 ) for item in results: print(f"ID: {item['id']}, Similarity: {item['similarity']:.3f}")

Batch Processing

For large datasets, process embeddings in batches and save to JSONL format:

import json from pathlib import Path from endee_model import SparseModel model = SparseModel(model_name="endee/bm25") documents = [ {"id": "1", "text": "First document content...", "title": "Doc 1"}, {"id": "2", "text": "Second document content...", "title": "Doc 2"}, ] output_path = Path("corpus_embeddings.jsonl") with open(output_path, "w") as f: texts = [doc["text"] for doc in documents] embeddings = model.embed(texts) for doc, emb in zip(documents, embeddings): record = { "id": doc["id"], "sparse_vector": { "indices": emb.indices.tolist(), "values": emb.values.tolist(), }, "meta": {"text": doc["text"], "title": doc["title"]} } f.write(json.dumps(record) + "\n")

JSONL record format:

{ "id": "doc_123", "sparse_vector": { "indices": [354307472, 794129062, 242156862], "values": [1.0887, 1.4566, 1.7527] }, "meta": { "text": "Original document text...", "title": "Document Title" } }

BM25 Scoring

BM25 (Best Matching 25) scores documents based on:

  1. Term Frequency (TF) — how often a term appears (with diminishing returns)
  2. Inverse Document Frequency (IDF) — how rare the term is across the corpus (rare = more informative)
  3. Length Normalization — prevents long documents from having an unfair advantage
BM25(d,q)=tqIDF(t)TF(t,d)(k1+1)TF(t,d)+k1(1b+bdavg_len)\text{BM25}(d, q) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{\text{TF}(t, d) \cdot (k_1 + 1)}{\text{TF}(t, d) + k_1 \cdot \left(1 - b + b \cdot \dfrac{|d|}{\text{avg\_len}}\right)}
ParameterMeaningTypical Value
k1TF saturation (diminishing returns)1.2–2.0
bLength normalization strength0.75

Sparse vs Dense Comparison

Sparse (BM25)Dense (e.g. MiniLM)
Vector size30,000+ dimensions (vocab size)384–1536 dimensions
Non-zero entriesOnly tokens present in the textAll entries non-zero
Dimension meaningWeight of a specific vocabulary tokenLearned semantic feature
MatchesExact keywordsSemantic/conceptual similarity
Example{"cancer": 1.7, "treatment": 1.2}[0.12, -0.34, 0.89, ...]

SDK Reference

Model Class

from endee_model import SparseModel model = SparseModel(model_name="endee/bm25")

Methods

MethodSignatureReturnsDescription
.embed()embed(documents: list[str], batch_size: int = 256)Iterable[SparseEmbedding]Embed documents — applies TF + length normalization
.query_embed()query_embed(query: str | list[str])Iterable[SparseEmbedding]Embed a query — all term weights set to 1.0
.token_count()token_count(texts: str | list[str])intTotal token count across all provided texts

SparseEmbedding

The object returned by .embed() and .query_embed() / .queryEmbed().

from endee_model import SparseEmbedding

Properties

PropertyTypeDescription
.indicesnp.ndarray (int32/int64)Token IDs — positions in the vocabulary
.valuesnp.ndarray (float32)BM25 TF weights for each token

Methods

MethodReturnsDescription
.as_dict()dict[int, float]{token_id: weight} — useful for inspection
.as_object()dict[str, np.ndarray]{"indices": array, "values": array} — numpy arrays
SparseEmbedding.from_dict(data)SparseEmbeddingConstruct from a {token_id: weight} dict