Skip to Content
Endee ToolsSparse Vectors (BM25)

Sparse Vectors (BM25)

The endee-model package provides Endee’s BM25 model for generating sparse vector embeddings. These sparse vectors capture keyword/term importance and are used in hybrid search alongside dense embeddings.

Why Sparse Vectors?

ApproachFinds
Dense onlySemantically similar content (different words, same meaning)
Sparse (BM25) onlyExact keyword matches
HybridBoth — best of both worlds
Query: "vitamin D cancer prevention" Dense match → finds docs about "sun exposure reduces tumour risk" (same meaning) Sparse match → finds docs containing exact words: vitamin, cancer, prevention Hybrid → combines both for best results

Installation

pip install endee-model

Quick Start

from endee_model import SparseModel # Load the BM25 model model = SparseModel(model_name="endee/bm25") # Generate document embeddings (for indexing) documents = [ "Vitamin D supplementation reduces cancer risk in elderly patients.", "Machine learning models can predict protein folding accurately.", ] doc_embeddings = list(model.embed(documents)) # Generate query embedding (for searching) query = "vitamin D and cancer prevention" query_embedding = next(model.query_embed(query))

.embed() vs .query_embed()

BM25 is asymmetric — documents and queries are weighted differently.

MethodUse ForTF (term frequency)IDF (rarity)Length Norm
.embed()Documents/corpusYesYesYes
.query_embed()Search queriesNoYesYes
  • Documents benefit from TF weighting — if a term appears multiple times, it’s likely important
  • Documents are length-normalized — a long document shouldn’t score higher just because it has more words
  • Queries use IDF-only weighting — each query term gets equal importance

Rule of thumb: Use .embed() on text you’re storing. Use .query_embed() on text you’re searching with.

Working with Sparse Embeddings

The model returns sparse embedding objects with indices and values arrays:

from endee_model import SparseModel model = SparseModel(model_name="endee/bm25") doc = "Vitamin D supplementation reduces cancer risk." embedding = next(model.embed([doc])) # Access the sparse vector components print(f"Indices: {embedding.indices[:5]}...") # Token positions print(f"Values: {embedding.values[:5]}...") # BM25 weights # Convert to dictionary format (token_id -> weight) token_weights = embedding.as_dict() print(f"Non-zero tokens: {len(token_weights)}")

Complete Workflow

Create a Hybrid Index

from endee import Endee, Precision client = Endee() client.create_index( name="documents", dimension=384, # Your dense embedding dimension sparse_model="endee_bm25", # Enable BM25 sparse vectors space_type="cosine", precision=Precision.INT8 ) index = client.get_index(name="documents")

Generate Embeddings and Upsert

from endee_model import SparseModel sparse_model = SparseModel(model_name="endee/bm25") # dense_model = your preferred dense embedding model (e.g., sentence-transformers) documents = [ "Vitamin D supplementation reduces cancer risk in elderly patients.", "Machine learning models can predict protein folding accurately.", "Regular exercise lowers the risk of cardiovascular disease.", ] sparse_embeddings = list(sparse_model.embed(documents)) vectors = [] for i, (doc, sparse_emb) in enumerate(zip(documents, sparse_embeddings)): vectors.append({ "id": f"doc_{i}", "vector": [...], # Your dense embedding here (384-dim) "sparse_indices": sparse_emb.indices.tolist(), "sparse_values": sparse_emb.values.tolist(), "meta": {"text": doc} }) index.upsert(vectors)
query = "vitamin D and cancer prevention" query_sparse = next(sparse_model.query_embed(query)) # query_dense = dense_model.encode(query) results = index.query( vector=[...], # Your dense query embedding sparse_indices=query_sparse.indices.tolist(), sparse_values=query_sparse.values.tolist(), top_k=5 ) for item in results: print(f"ID: {item['id']}, Similarity: {item['similarity']:.3f}")

Batch Processing

For large datasets, process embeddings in batches and save to JSONL format:

import json from pathlib import Path from endee_model import SparseModel model = SparseModel(model_name="endee/bm25") documents = [ {"id": "1", "text": "First document content...", "title": "Doc 1"}, {"id": "2", "text": "Second document content...", "title": "Doc 2"}, ] output_path = Path("corpus_embeddings.jsonl") with open(output_path, "w") as f: texts = [doc["text"] for doc in documents] embeddings = model.embed(texts) for doc, emb in zip(documents, embeddings): record = { "id": doc["id"], "sparse_vector": { "indices": emb.indices.tolist(), "values": emb.values.tolist(), }, "meta": {"text": doc["text"], "title": doc["title"]} } f.write(json.dumps(record) + "\n")

JSONL record format:

{ "id": "doc_123", "sparse_vector": { "indices": [354307472, 794129062, 242156862], "values": [1.0887, 1.4566, 1.7527] }, "meta": { "text": "Original document text...", "title": "Document Title" } }

BM25 Scoring

BM25 (Best Matching 25) scores documents based on:

  1. Term Frequency (TF) — how often a term appears (with diminishing returns)
  2. Inverse Document Frequency (IDF) — how rare the term is across the corpus (rare = more informative)
  3. Length Normalization — prevents long documents from having an unfair advantage
BM25 score = Σ IDF(term) × TF(term, doc) × (k1 + 1) ────────────────────────── TF(term, doc) + k1×(1 - b + b × doc_len/avg_len)
ParameterMeaningTypical Value
k1TF saturation (diminishing returns)1.2–2.0
bLength normalization strength0.75

Sparse vs Dense Comparison

Sparse (BM25)Dense (e.g. MiniLM)
Vector size30,000+ dimensions (vocab size)384–1536 dimensions
Non-zero entriesOnly tokens present in the textAll entries non-zero
Dimension meaningWeight of a specific vocabulary tokenLearned semantic feature
MatchesExact keywordsSemantic/conceptual similarity
Example{"cancer": 1.7, "treatment": 1.2}[0.12, -0.34, 0.89, ...]

API Reference

SparseModel

model = SparseModel(model_name="endee/bm25")
MethodDescription
.embed(texts)Generate document embeddings — applies TF × IDF + length normalization
.query_embed(text)Generate query embedding — applies IDF only

SparseEmbedding

Property/MethodTypeDescription
.indicesnp.ndarrayNon-zero token positions
.valuesnp.ndarrayBM25 weights for each position
.as_dict()dictDictionary mapping token_id → weight