Skip to Content
TutorialsSearch with Filters

Filtered search

Open In Colab

Combine semantic vector search with server-side metadata filters ($eq, $in, $range) to build precise retrieval pipelines.

Filtered search answers: Which documents, among those I care about, are most similar to this query? Filters are eligibility gates evaluated server-side before any object ranking - a document that fails the filter never enters ranking.


Filter operators

OperatorDescriptionExample
$eqExact match - string, bool, or int{"category": {"$eq": "tech"}}
$inList membership - OR within a field{"category": {"$in": ["health", "science"]}}
$rangeInclusive numeric range [start, end]{"year": {"$range": [2022, 2024]}}

Multiple entries in the filter list are always ANDed:

filter=[ {"category": {"$eq": "tech"}}, # must match {"year": {"$range": [2022, 2024]}}, # AND must match {"premium": {"$eq": True}}, # AND must match ]

OR across different fields requires separate queries merged client-side. OR within a single field is what $in is for.


Installation

pip install --upgrade endee sentence-transformers pip install numpy==2.0.0

Imports

from getpass import getpass from endee import Endee from sentence_transformers import SentenceTransformer

Authentication

Local server

If NDD_AUTH_TOKEN is set on the server, pass the same token:

client = Endee("ndd-auth-token") client.set_base_url("http://0.0.0.0:8080/api/v1")

Endee Cloud

Create a token at app.endee.io :

client = Endee("your-serverless-token")

Creating a collection

COLLECTION_NAME = "dense_filter_demo" try: client.delete_collection(COLLECTION_NAME) except Exception: pass client.create_collection( name=COLLECTION_NAME, dimension=384, space_type="cosine", ) collection = client.get_collection(COLLECTION_NAME) print(f"Collection '{COLLECTION_NAME}' ready")

Loading the embedding model

all-MiniLM-L6-v2 converts text into 384-dimensional vectors. Load it once and reuse for both indexing and querying.

dense_model = SentenceTransformer("all-MiniLM-L6-v2")

Preparing the dataset

16 research articles across four categories. Each document has five metadata fields used as filter dimensions: category, year, rating, author, and premium.

Every field you want to filter on must be declared in the filter dict at upsert time. Fields missing from filter cannot be queried later.

DOCUMENTS = [ # tech {"id": "doc_01", "text": "Neural networks are revolutionising image recognition and computer vision tasks", "meta": {"title": "Neural Nets & Vision", "category": "tech", "year": 2023, "rating": 4.5, "author": "alice", "premium": True}}, {"id": "doc_02", "text": "Quantum computing promises exponential speedup for optimisation and cryptography", "meta": {"title": "Quantum Computing", "category": "tech", "year": 2024, "rating": 4.8, "author": "alice", "premium": True}}, # ...more documents ] print(f"{len(DOCUMENTS)} documents ready")

Indexing documents

For each document, encode the text and build a payload with two separate dicts:

  • meta - returned with results, not searchable
  • filter - declares every field available for filtering at query time
payload = [] for doc in DOCUMENTS: vec = dense_model.encode(doc["text"]).tolist() m = doc["meta"] payload.append({ "id": doc["id"], "vector": vec, "meta": m, "filter": { "category": m["category"], "year": m["year"], "rating": m["rating"], "author": m["author"], "premium": m["premium"], }, }) collection.upsert(payload) print(f"{len(payload)} documents indexed")

Setting up queries

All queries use the same text. The show_results helper prints rank, score, and metadata.

QUERY = "AI applications in healthcare and medicine" query_vec = dense_model.encode(QUERY).tolist() TOP_K = 5 def show_results(results, label=""): if label: print(f"Filter: {label}") for rank, r in enumerate(results, 1): m = r["meta"] print(f" {rank}. score={r['similarity']:.4f} [{m['category']}] {m['title']} ({m['author']}, {m['year']}, rating={m['rating']}, premium={m['premium']})") print()

Baseline search (no filter)

Search all documents to see what the dense model considers relevant before any filtering.

results = collection.query(vector=query_vec, top_k=TOP_K) show_results(results, label="none - all documents are candidates")

$eq - Exact match

Restrict to documents where a field exactly equals a given value:

results = collection.query( vector=query_vec, top_k=TOP_K, filter=[{"category": {"$eq": "health"}}], ) show_results(results, label='category == "health" (4 candidates)')

$range - Numeric range

Takes a two-value list [start, end] - both ends inclusive. Works on any numeric field.

results = collection.query( vector=query_vec, top_k=TOP_K, filter=[{"year": {"$range": [2022, 2024]}}], ) show_results(results, label="year in [2022, 2024] (13 candidates)")

$in - Match any value from a list

OR within a single field. A document passes if its value matches any item in the list.

results = collection.query( vector=query_vec, top_k=TOP_K, filter=[{"category": {"$in": ["health", "science"]}}], ) show_results(results, label='category in ["health", "science"] (8 candidates)')

Combining multiple filters

Multiple filters are ANDed - a document must satisfy every condition:

results = collection.query( vector=query_vec, top_k=TOP_K, filter=[ {"category": {"$in": ["health", "science"]}}, {"year": {"$range": [2021, 2023]}}, {"premium": {"$eq": True}}, ], ) show_results(results, label='category in ["health","science"] AND year in [2021,2023] AND premium == True')

Filter reference

Common filter patterns ready to copy:

What you wantFilter
Only tech articles[{"category": {"$eq": "tech"}}]
Only bob’s articles[{"author": {"$eq": "bob"}}]
Only premium content[{"premium": {"$eq": True}}]
Health articles that are premium[{"category": {"$eq": "health"}}, {"premium": {"$eq": True}}]
Alice’s tech articles only[{"category": {"$eq": "tech"}}, {"author": {"$eq": "alice"}}]
Rating 4.0 and above[{"rating": {"$range": [4.0, 5.0]}}]
Top-rated articles only (4.5+)[{"rating": {"$range": [4.5, 5.0]}}]
Articles by alice or bob[{"author": {"$in": ["alice", "bob"]}}]
Only 2022 and 2024 (skip 2023)[{"year": {"$in": [2022, 2024]}}]
Recent high-quality tech articles[{"category": {"$eq": "tech"}}, {"year": {"$range": [2022, 2024]}}, {"rating": {"$range": [4.3, 5.0]}}]

Cleanup

client.delete_collection(COLLECTION_NAME) print(f"Deleted: {COLLECTION_NAME}")

Key takeaways

  • Filters run before object ranking - the embedding model only sees documents that passed the filter
  • Declare filter fields at upsert time - any field you want to filter later must be in the filter dict when indexing
  • meta and filter are separate - meta is returned with results; filter controls what enters ranking
  • All filters are ANDed - for OR across different fields, run separate queries and merge client-side