Sparse Retrieval

TL;DR

Retrieval methods that use explicit keywords and term matching to find relevant documents, contrasting with semantic similarity-based approaches.

Sparse retrieval is keyword-based search: you search for documents containing specific words or phrases. It's called "sparse" because the representation of documents is sparse (most words don't appear in a document). This contrasts with dense retrieval (semantic search using embeddings).

Examples of sparse retrieval include: BM25 (a probabilistic ranking algorithm), TF-IDF (term frequency-inverse document frequency), and boolean keyword search. These methods are well-understood, widely deployed, and extremely fast.

Sparse retrieval excels at exact matches and phrase searches. If you search for "COVID-19 vaccine," sparse retrieval finds documents containing those exact terms. It's predictable and interpretable: you know why documents were retrieved (they contain your search terms).

The weakness of sparse retrieval is that it doesn't understand meaning. If you search for "vehicle," sparse retrieval won't find documents about "car" or "truck" (unless they also contain "vehicle"). It's rigid and can't handle synonyms, abbreviations, or conceptual relationships.

In the era of dense retrieval (semantic search using embeddings), sparse retrieval got less attention. But smart systems use both. Hybrid search combines sparse and dense retrieval: use sparse retrieval to quickly filter large document collections, then use dense retrieval to rank remaining documents. Or use them in parallel and combine results.

Sparse retrieval is also useful when you have specialized vocabulary. In medical search, you might have specific medical terms. Sparse retrieval reliably finds them. Semantic search might miss them if the terms aren't well-represented in embedding space.

Implementation is straightforward. Inverted indexes (mapping words to documents containing them) enable fast lookup. Once candidate documents are identified, ranking algorithms order them by relevance.

The recent trend is hybrid approaches. Dense retrieval is more sophisticated but slower. Sparse retrieval is faster but simpler. Modern systems often use sparse retrieval as a first stage (narrow down from millions of documents to hundreds), then dense retrieval as a second stage (rank the hundreds).

There's also been a renaissance of interest in sparse retrieval for reasons: large language models have become more effective at ranking sparse retrieval results, and hybrid approaches often outperform pure dense retrieval.

In enterprise search, sparse retrieval is often the foundation. Companies have search infrastructure built on sparse retrieval technologies. Migrating to pure dense retrieval is costly. Hybrid approaches that enhance existing sparse retrieval with dense components are more practical.

Why It Matters

Sparse retrieval remains valuable despite the rise of dense retrieval. It's fast, interpretable, and effective for structured queries. Hybrid approaches combining both often outperform either alone.

Example

A legal database uses hybrid search: sparse retrieval finds cases containing specific legal terms (statute numbers, attorney names, court names), dense retrieval reranks by meaning relevance, combining both retrieves cases that are both relevant by keyword and by meaning.

Related Terms

Optimize retrieval with Synap