Sparse Vectors (BM25)

The endee-model package provides Endee’s BM25 model for generating sparse vector embeddings. These sparse vectors capture keyword/term importance and are used in hybrid search alongside dense embeddings.

Why Sparse Vectors?

Approach	Finds
Dense only	Semantically similar content (different words, similar meaning)
Sparse (BM25) only	Exact keyword matches
Hybrid	best of both worlds


Query: "vitamin D cancer prevention"

Dense match   → finds docs about "sun exposure reduces tumour risk" (similar meaning)
Sparse match  → finds docs containing exact words: vitamin, cancer, prevention
Hybrid        → combines both for best results

Installation

Python


pip install endee-model

The endee-model BM25 sparse embedding library is available for Python and TypeScript. Bring your own sparse vectors generated from any BM25 implementation.The sparse_model you set at collection creation controls how Endee interprets these values: use endee_bm25 to send TF weights only (Endee applies IDF server-side), or default to send final scores as-is for SPLADE or custom BM25 models.

Quick Start

Python


from endee_model import SparseModel
 
# Load the BM25 model
model = SparseModel(model_name="endee/bm25")
 
# Generate document embeddings (for indexing)
documents = [
    "Vitamin D supplementation reduces cancer risk in elderly patients.",
    "Machine learning models can predict protein folding accurately.",
]
doc_embeddings = list(model.embed(documents))
 
# Generate query embedding (for searching)
query = "vitamin D and cancer prevention"
query_embedding = next(model.query_embed(query))

`.embed()` vs `.query_embed()`

BM25 is asymmetric: documents and queries are weighted differently.

Method	Use For	TF (term frequency)	IDF (rarity)	Length Norm
.embed()	Documents/corpus	Yes	Yes	Yes
.query_embed()	Search queries	No	Yes	Yes

Documents benefit from TF weighting: if a term appears multiple times, it’s likely important
Documents are length-normalized: a long document shouldn’t score higher just because it has more words
Queries use IDF-only weighting: each query term gets equal importance

Rule of thumb: Use .embed() on text you’re storing. Use .query_embed() on text you’re searching with.

Working with Sparse Embeddings

The model returns sparse embedding objects with indices and values arrays:

Python


from endee_model import SparseModel
 
model = SparseModel(model_name="endee/bm25")
 
doc = "Vitamin D supplementation reduces cancer risk."
embedding = next(model.embed([doc]))
 
# Access the sparse vector components
print(f"Indices: {embedding.indices[:5]}...")  # Token positions
print(f"Values: {embedding.values[:5]}...")    # BM25 weights
 
# Convert to dictionary format (token_id -> weight)
token_weights = embedding.as_dict()
print(f"Non-zero tokens: {len(token_weights)}")

Complete Workflow

Create a Hybrid Collection

Python


from endee import Endee
 
client = Endee("your-serverless-token")
 
client.create_collection(
    name="documents",
    fields=[
        # Dense field for semantic search
        {"name": "embedding", "type": "vector",
         "params": {"dimension": 384, "space_type": "cosine", "precision": "int8"}},
        # Sparse field for BM25 keyword search
        {"name": "keywords", "type": "sparse", "sparse_model": "endee_bm25"},
    ],
)
 
collection = client.get_collection("documents")

The sparse_model parameter

For sparse_model you have two options depending on which sparse model you use:

sparse_model="endee_bm25": use this when your sparse objects come from endee/bm25. Endee holds the IDF weights on its server and applies them automatically, so you only need to send the TF weights from your client.
sparse_model="default": use this for SPLADE models or any other BM25 model. In this case Endee treats the values you send as final scores and does no further calculation. If you are using another BM25 model (not endee/bm25), you must compute the full IDF scores yourself on the client before sending them.

Generate Embeddings and Upsert

Python


from endee_model import SparseModel
 
sparse_model = SparseModel(model_name="endee/bm25")
# dense_model = your preferred dense embedding model (e.g., sentence-transformers)
 
documents = [
    "Vitamin D supplementation reduces cancer risk in elderly patients.",
    "Machine learning models can predict protein folding accurately.",
    "Regular exercise lowers the risk of cardiovascular disease.",
]
 
sparse_embeddings = list(sparse_model.embed(documents))
 
objects = []
for i, (doc, sparse_emb) in enumerate(zip(documents, sparse_embeddings)):
    objects.append({
        "id": f"doc_{i}",
        "meta": {"text": doc},
        "fields": {
            "embedding": [...],  # Your dense embedding here (384-dim)
            "keywords": {
                "indices": sparse_emb.indices.tolist(),
                "values": sparse_emb.values.tolist(),
            },
        },
    })
 
collection.upsert(objects)

Query with Hybrid Search

Python


from endee import rerank
 
query = "vitamin D and cancer prevention"
 
query_sparse = next(sparse_model.query_embed(query))
# query_dense = dense_model.encode(query)
 
# Query both fields, then fuse with RRF into a single ranked list.
res = collection.search(
    fields={
        "embedding": {"query": [...], "limit": 50},  # Your dense query embedding
        "keywords": {
            "query": {
                "indices": query_sparse.indices.tolist(),
                "values": query_sparse.values.tolist(),
            },
            "limit": 50,
        },
    },
)
 
fused = rerank(res, limit=5)
 
for item in fused["results"]:
    print(f"ID: {item['id']}, Similarity: {item['similarity']:.3f}")

Batch Processing

For large datasets, process embeddings in batches and save to JSONL format:

Python


import json
from pathlib import Path
from endee_model import SparseModel
 
model = SparseModel(model_name="endee/bm25")
 
documents = [
    {"id": "1", "text": "First document content...", "title": "Doc 1"},
    {"id": "2", "text": "Second document content...", "title": "Doc 2"},
]
 
output_path = Path("corpus_embeddings.jsonl")
 
with open(output_path, "w") as f:
    texts = [doc["text"] for doc in documents]
    embeddings = model.embed(texts)
 
    for doc, emb in zip(documents, embeddings):
        record = {
            "id": doc["id"],
            "sparse_vector": {
                "indices": emb.indices.tolist(),
                "values": emb.values.tolist(),
            },
            "meta": {"text": doc["text"], "title": doc["title"]}
        }
        f.write(json.dumps(record) + "\n")

JSONL record format:


{
  "id": "doc_123",
  "sparse_vector": {
    "indices": [354307472, 794129062, 242156862],
    "values":  [1.0887, 1.4566, 1.7527]
  },
  "meta": {
    "text": "Original document text...",
    "title": "Document Title"
  }
}

BM25 Scoring

BM25 (Best Matching 25) scores documents based on:

Term Frequency (TF): how often a term appears (with diminishing returns)
Inverse Document Frequency (IDF): how rare the term is across the corpus (rare = more informative)
Length Normalization: prevents long documents from having an unfair advantage

\text{BM25}(d, q) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{\text{TF}(t, d) \cdot (k_1 + 1)}{\text{TF}(t, d) + k_1 \cdot \left(1 - b + b \cdot \dfrac{|d|}{\text{avg\_len}}\right)}

Parameter	Meaning	Typical Value
k1	TF saturation (diminishing returns)	1.2–2.0
b	Length normalization strength	0.75

Sparse vs Dense Comparison

	Sparse (BM25)	Dense (e.g. MiniLM)
Object size	30,000+ dimensions (vocab size)	384–1536 dimensions
Non-zero entries	Only tokens present in the text	All entries non-zero
Dimension meaning	Weight of a specific vocabulary token	Learned semantic feature
Matches	Exact keywords	Semantic/conceptual similarity
Example	`{"cancer": 1.7, "treatment": 1.2}`	`[0.12, -0.34, 0.89, ...]`

SDK Reference

Model Class

Python


from endee_model import SparseModel
 
model = SparseModel(model_name="endee/bm25")

Methods

Python

Method	Signature	Returns	Description
`.embed()`	`embed(documents: list[str], batch_size: int = 256)`	`Iterable[SparseEmbedding]`	Embed documents (applies TF + length normalization)
`.query_embed()`	`query_embed(query: str \| list[str])`	`Iterable[SparseEmbedding]`	Embed a query (all term weights set to 1.0)
`.token_count()`	`token_count(texts: str \| list[str])`	`int`	Total token count across all provided texts

Method	Signature	Returns	Description
`.embed()`	`embed(documents: string[], batchSize?: number)`	`Generator<SparseEmbedding>`	Embed documents (applies TF + length normalization)
`.queryEmbed()`	`queryEmbed(query: string \| string[])`	`Generator<SparseEmbedding>`	Embed a query (all term weights set to 1.0)
`.tokenCount()`	`tokenCount(texts: string \| string[])`	`number`	Total token count across all provided texts

`SparseEmbedding`

The object returned by .embed() and .query_embed() / .queryEmbed().

Python


from endee_model import SparseEmbedding

Properties

Python

Property	Type	Description
`.indices`	`np.ndarray` (int32/int64)	Token IDs (positions in the vocabulary)
`.values`	`np.ndarray` (float32)	BM25 TF weights for each token

Property	Type	Description
`.indices`	`Int32Array`	Token IDs (positions in the vocabulary)
`.values`	`Float32Array`	BM25 TF weights for each token

Methods

Python

Method	Returns	Description
`.as_dict()`	`dict[int, float]`	`{token_id: weight}` (useful for inspection)
`.as_object()`	`dict[str, np.ndarray]`	`{"indices": array, "values": array}` (numpy arrays)
`SparseEmbedding.from_dict(data)`	`SparseEmbedding`	Construct from a `{token_id: weight}` dict

Method	Returns	Description
`.asObject()`	`{ indices: Int32Array, values: Float32Array }`	Typed arrays (use `Array.from()` when passing to the Endee client)
`.asDict()`	`Record<number, number>`	`{token_id: weight}` (useful for inspection)
`SparseEmbedding.fromDict(data)`	`SparseEmbedding`	Construct from a `{token_id: weight}` object

Sparse Vectors (BM25)

Why Sparse Vectors?

Installation

Python

TypeScript

Quick Start

Python

TypeScript

.embed() vs .query_embed()

Working with Sparse Embeddings

Python

TypeScript

Complete Workflow

Create a Hybrid Collection

Python

TypeScript

Generate Embeddings and Upsert

Python

TypeScript

Query with Hybrid Search

Python

TypeScript

Batch Processing

Python

TypeScript

BM25 Scoring

Sparse vs Dense Comparison

SDK Reference

Model Class

Python

TypeScript

Methods

Python

TypeScript

SparseEmbedding

Python

TypeScript

Properties

Python

TypeScript

Methods

Python

TypeScript

`.embed()` vs `.query_embed()`

`SparseEmbedding`