A Beginner’s Guide to Semantic Search by Pietro Mele
(link)Summary
# Overview Pietro Mele introduces semantic search for developers who already know the basics of search systems. The talk starts with a review of lexical search, then explains why keyword-based retrieval can fail on context, vocabulary mismatch, and long queries. From there, it moves into embeddings, dense vectors, vector spaces, and the core idea behind semantic retrieval: comparing a query vector to indexed document vectors by similarity. The talk stays beginner-friendly and focuses on concepts rather than implementation details. It also contrasts semantic search with lexical search, notes that both can be combined in a hybrid approach, and briefly touches on nearest-neighbor search and similarity metrics such as Euclidean distance, cosine similarity, and Hamming distance.
Key Takeaways
- Lexical search is keyword-based and relies on analysis, tokenization, stemming, synonyms, and lemmatization before indexing.
- Semantic search uses embeddings to represent documents and queries as vectors in a shared vector space.
- Semantic retrieval helps with context-aware matching, vocabulary mismatch, and long queries where lexical search can struggle.
- Dense vectors were the main focus; sparse vectors and deeper implementation details were deferred to a later talk.
- Similarity search is commonly done with k-nearest neighbors or approximate nearest neighbors for large vector collections.
- Cosine similarity is a common choice for high-dimensional vectors, while Euclidean distance and Hamming distance were also mentioned.
- Semantic search is not a replacement for lexical search in all cases; hybrid search can combine both approaches.
Sections
What lexical search does well, and where it fails
The talk begins by describing lexical search as keyword-based retrieval over analyzed documents. It uses filters such as stemming, synonyms, and lemmatization to improve matching, and search engines can explain relevance scoring. However, lexical search is limited when the query requires context, when different words refer to the same concept, or when the query is long and specific. Examples included "chocolate milk" returning poor matches and "God particle" versus "Higgs boson" producing very different results.
Semantic search and embeddings
Semantic search is presented as a way to represent documents and queries with embeddings, or vectors, generated by machine learning models. Those vectors capture semantic meaning rather than exact words. The talk notes that semantic search can handle multiple content types, including text, images, and audio, as long as they can be embedded into a vector space and indexed in systems such as Elasticsearch or OpenSearch.
Vector space intuition
To make the idea concrete, the speaker uses a simple three-dimensional example with features like size, friendliness, and intelligence. In that simplified space, similar animals are placed closer together. The key point is that real embedding spaces have far more dimensions—often around a thousand or more—so similarity is determined by proximity in a high-dimensional vector space, not by human-readable labels.
Retrieval with nearest neighbors
At query time, the user query is also converted into a vector and compared against the indexed vectors. The talk introduces two broad retrieval families: k-nearest neighbors (KNN), which compares the query against the full set of vectors, and approximate nearest neighbors (ANN), which trades exactness for speed on very large collections. ANN was mentioned as the more practical option for billion-scale vector search.
Measuring similarity
The presentation briefly compares common similarity measures. Euclidean distance measures straight-line distance between vectors, but the speaker notes that it does not scale well in high-dimensional spaces. Cosine similarity checks angular distance and is commonly used for vector search. Hamming distance was also mentioned as a bitwise difference measure in some vector-search contexts.
Practical conclusions
The talk closes by emphasizing that semantic search is powerful because it adds context awareness, but it is not mandatory for every search problem. Strong lexical search can solve many of the same problems with the right configuration. The recommended path is often hybrid search, combining lexical and semantic techniques. The speaker also points out that real vector generation is done by models, not manual feature scoring, and that high dimensionality creates optimization challenges.
Keywords: semantic search, lexical search, embeddings, dense vectors, vector search, vector space, elasticsearch, opensearch, solr, k-nearest neighbors, approximate nearest neighbors, cosine similarity, euclidean distance, hamming distance, hybrid search, document retrieval, vocabulary mismatch, stemming, synonyms, lemmatization