Sparse Vectors (BM25)

The endee-model package provides Endee’s BM25 model for generating sparse vector embeddings. These sparse vectors capture keyword/term importance and are used in hybrid search alongside dense embeddings.

Why Sparse Vectors?

Approach	Finds
Dense only	Semantically similar content (different words, similar meaning)
Sparse (BM25) only	Exact keyword matches
Hybrid	best of both worlds


Query: "vitamin D cancer prevention"

Dense match   → finds docs about "sun exposure reduces tumour risk" (similar meaning)
Sparse match  → finds docs containing exact words: vitamin, cancer, prevention
Hybrid        → combines both for best results

Installation

Python


pip install endee-model

The endee-model BM25 sparse embedding library is currently available for Python and TypeScript only. Java and Go support hybrid indexes using the default sparse model — bring your own sparse vectors generated from any BM25 implementation.

Quick Start

Python


from endee_model import SparseModel
 
# Load the BM25 model
model = SparseModel(model_name="endee/bm25")
 
# Generate document embeddings (for indexing)
documents = [
    "Vitamin D supplementation reduces cancer risk in elderly patients.",
    "Machine learning models can predict protein folding accurately.",
]
doc_embeddings = list(model.embed(documents))
 
# Generate query embedding (for searching)
query = "vitamin D and cancer prevention"
query_embedding = next(model.query_embed(query))

`.embed()` vs `.query_embed()`

BM25 is asymmetric — documents and queries are weighted differently.

Method	Use For	TF (term frequency)	IDF (rarity)	Length Norm
.embed()	Documents/corpus	Yes	Yes	Yes
.query_embed()	Search queries	No	Yes	Yes

Documents benefit from TF weighting — if a term appears multiple times, it’s likely important
Documents are length-normalized — a long document shouldn’t score higher just because it has more words
Queries use IDF-only weighting — each query term gets equal importance

Rule of thumb: Use .embed() on text you’re storing. Use .query_embed() on text you’re searching with.

Working with Sparse Embeddings

The model returns sparse embedding objects with indices and values arrays:

Python


from endee_model import SparseModel
 
model = SparseModel(model_name="endee/bm25")
 
doc = "Vitamin D supplementation reduces cancer risk."
embedding = next(model.embed([doc]))
 
# Access the sparse vector components
print(f"Indices: {embedding.indices[:5]}...")  # Token positions
print(f"Values: {embedding.values[:5]}...")    # BM25 weights
 
# Convert to dictionary format (token_id -> weight)
token_weights = embedding.as_dict()
print(f"Non-zero tokens: {len(token_weights)}")

Complete Workflow

Create a Hybrid Index

Python


from endee import Endee, Precision
 
client = Endee()
 
client.create_index(
    name="documents",
    dimension=384,              # Your dense embedding dimension
    sparse_model="endee_bm25",  # Enable BM25 sparse vectors
    space_type="cosine",
    precision=Precision.INT8
)
 
index = client.get_index(name="documents")

Generate Embeddings and Upsert

Python


from endee_model import SparseModel
 
sparse_model = SparseModel(model_name="endee/bm25")
# dense_model = your preferred dense embedding model (e.g., sentence-transformers)
 
documents = [
    "Vitamin D supplementation reduces cancer risk in elderly patients.",
    "Machine learning models can predict protein folding accurately.",
    "Regular exercise lowers the risk of cardiovascular disease.",
]
 
sparse_embeddings = list(sparse_model.embed(documents))
 
vectors = []
for i, (doc, sparse_emb) in enumerate(zip(documents, sparse_embeddings)):
    vectors.append({
        "id": f"doc_{i}",
        "vector": [...],  # Your dense embedding here (384-dim)
        "sparse_indices": sparse_emb.indices.tolist(),
        "sparse_values": sparse_emb.values.tolist(),
        "meta": {"text": doc}
    })
 
index.upsert(vectors)

Query with Hybrid Search

Python


query = "vitamin D and cancer prevention"
 
query_sparse = next(sparse_model.query_embed(query))
# query_dense = dense_model.encode(query)
 
results = index.query(
    vector=[...],  # Your dense query embedding
    sparse_indices=query_sparse.indices.tolist(),
    sparse_values=query_sparse.values.tolist(),
    top_k=5
)
 
for item in results:
    print(f"ID: {item['id']}, Similarity: {item['similarity']:.3f}")

Batch Processing

For large datasets, process embeddings in batches and save to JSONL format:

Python


import json
from pathlib import Path
from endee_model import SparseModel
 
model = SparseModel(model_name="endee/bm25")
 
documents = [
    {"id": "1", "text": "First document content...", "title": "Doc 1"},
    {"id": "2", "text": "Second document content...", "title": "Doc 2"},
]
 
output_path = Path("corpus_embeddings.jsonl")
 
with open(output_path, "w") as f:
    texts = [doc["text"] for doc in documents]
    embeddings = model.embed(texts)
 
    for doc, emb in zip(documents, embeddings):
        record = {
            "id": doc["id"],
            "sparse_vector": {
                "indices": emb.indices.tolist(),
                "values": emb.values.tolist(),
            },
            "meta": {"text": doc["text"], "title": doc["title"]}
        }
        f.write(json.dumps(record) + "\n")

JSONL record format:


{
  "id": "doc_123",
  "sparse_vector": {
    "indices": [354307472, 794129062, 242156862],
    "values":  [1.0887, 1.4566, 1.7527]
  },
  "meta": {
    "text": "Original document text...",
    "title": "Document Title"
  }
}

BM25 Scoring

BM25 (Best Matching 25) scores documents based on:

Term Frequency (TF) — how often a term appears (with diminishing returns)
Inverse Document Frequency (IDF) — how rare the term is across the corpus (rare = more informative)
Length Normalization — prevents long documents from having an unfair advantage

\text{BM25}(d, q) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{\text{TF}(t, d) \cdot (k_1 + 1)}{\text{TF}(t, d) + k_1 \cdot \left(1 - b + b \cdot \dfrac{|d|}{\text{avg\_len}}\right)}

Parameter	Meaning	Typical Value
k1	TF saturation (diminishing returns)	1.2–2.0
b	Length normalization strength	0.75

Sparse vs Dense Comparison

	Sparse (BM25)	Dense (e.g. MiniLM)
Vector size	30,000+ dimensions (vocab size)	384–1536 dimensions
Non-zero entries	Only tokens present in the text	All entries non-zero
Dimension meaning	Weight of a specific vocabulary token	Learned semantic feature
Matches	Exact keywords	Semantic/conceptual similarity
Example	`{"cancer": 1.7, "treatment": 1.2}`	`[0.12, -0.34, 0.89, ...]`

SDK Reference

Model Class

Python


from endee_model import SparseModel
 
model = SparseModel(model_name="endee/bm25")

Methods

Method	Signature	Returns	Description
`.embed()`	`embed(documents: list[str], batch_size: int = 256)`	`Iterable[SparseEmbedding]`	Embed documents — applies TF + length normalization
`.query_embed()`	`query_embed(query: str \| list[str])`	`Iterable[SparseEmbedding]`	Embed a query — all term weights set to 1.0
`.token_count()`	`token_count(texts: str \| list[str])`	`int`	Total token count across all provided texts

Method	Signature	Returns	Description
`.embed()`	`embed(documents: string[], batchSize?: number)`	`Generator<SparseEmbedding>`	Embed documents — applies TF + length normalization
`.queryEmbed()`	`queryEmbed(query: string \| string[])`	`Generator<SparseEmbedding>`	Embed a query — all term weights set to 1.0
`.tokenCount()`	`tokenCount(texts: string \| string[])`	`number`	Total token count across all provided texts

`SparseEmbedding`

The object returned by .embed() and .query_embed() / .queryEmbed().

Python


from endee_model import SparseEmbedding

Properties

Property	Type	Description
`.indices`	`np.ndarray` (int32/int64)	Token IDs — positions in the vocabulary
`.values`	`np.ndarray` (float32)	BM25 TF weights for each token

Methods

Method	Returns	Description
`.as_dict()`	`dict[int, float]`	`{token_id: weight}` — useful for inspection
`.as_object()`	`dict[str, np.ndarray]`	`{"indices": array, "values": array}` — numpy arrays
`SparseEmbedding.from_dict(data)`	`SparseEmbedding`	Construct from a `{token_id: weight}` dict

Method	Returns	Description
`.asObject()`	`{ indices: Int32Array, values: Float32Array }`	Typed arrays — use `Array.from()` when passing to the Endee client
`.asDict()`	`Record<number, number>`	`{token_id: weight}` — useful for inspection
`SparseEmbedding.fromDict(data)`	`SparseEmbedding`	Construct from a `{token_id: weight}` object

Property	Type	Description
`.indices`	`Int32Array`	Token IDs — positions in the vocabulary
`.values`	`Float32Array`	BM25 TF weights for each token

Sparse Vectors (BM25)

Why Sparse Vectors?

Installation

Python

TypeScript

Quick Start

Python

TypeScript

.embed() vs .query_embed()

Working with Sparse Embeddings

Python

TypeScript

Complete Workflow

Create a Hybrid Index

Python

TypeScript

Generate Embeddings and Upsert

Python

TypeScript

Query with Hybrid Search

Python

TypeScript

Batch Processing

Python

TypeScript

BM25 Scoring

Sparse vs Dense Comparison

SDK Reference

Model Class

Python

Methods

TypeScript

Methods

SparseEmbedding

Python

Properties

Methods

TypeScript

Properties

Methods

`.embed()` vs `.query_embed()`

`SparseEmbedding`