cotalks.dev

Behind the hype: managing billion-scale embeddings in Elasticsearch and OpenSearch by Pietro Mele

(link)
Channel: Devoxx

Summary

Pietro Mele explains the practical limits of semantic search when embeddings grow to hundreds of millions or billions of vectors. The talk focuses on how Elasticsearch and OpenSearch handle indexing, storage, ANN search, and memory pressure, with concrete guidance on chunking, model choice, quantization, and disk-based search. It compares the two stacks where relevant, while emphasizing shared Lucene-based concepts and the operational tradeoffs that matter in production. The main message is that semantic search is powerful, but scale changes everything: RAM, disk, segment count, index layout, and retrieval strategy become the real constraints. The talk outlines techniques to keep clusters usable, including approximate nearest neighbor search, GPU-assisted indexing, source-field handling, and reranking after quantized retrieval.

Key Takeaways

  • Billion-scale vector search is usually constrained by RAM before disk; quantization and disk-based retrieval are key mitigation strategies.
  • Chunking based on token length is necessary to preserve context when embedding long texts, documents, or multiple fields.
  • Approximate nearest neighbor search is required at scale; brute-force kNN does not work for very large embedding collections.
  • Model choice should be validated on a representative test set using semantic search metrics such as recall, MRR, and NDCG.
  • Elasticsearch and OpenSearch share many underlying concepts, but differ in available processors, quantization options, and disk-oriented search features.
  • Reducing segment count with force merge and using source-field optimizations can materially improve search performance and storage usage.

Sections

Semantic search basics and why scale hurts

The talk starts with a quick refresher on semantic search: documents and queries are converted into vectors by an embedding model, then matched by similarity instead of keywords. This works well for verbose queries and multilingual scenarios, but the data volume grows quickly because a single document may produce multiple vectors due to multiple fields, chunking, or multiple models. At high scale, the problem is no longer the concept of vector search, but the cost of storing and retrieving very large vector collections.

Chunking and model selection for embeddings

A key step is chunking long text before embedding it. The speaker notes that chunking should follow token boundaries rather than raw characters or words, because embedding models have a fixed context window. He recommends testing candidate models on a small but representative dataset before indexing everything, and validating relevance with metrics such as recall, MRR, and NDCG. Smaller embedding models can be sufficient, and sometimes outperform larger ones for a given use case.

Indexing workflows in Elasticsearch and OpenSearch

The presentation compares ingestion approaches in both products. OpenSearch offers pipelines and a text chunking processor, while Elasticsearch has equivalent processors such as recursive chunking. For embedding generation, both systems can use local models or external inference endpoints/connectors. The speaker also discusses deployment options for model artifacts, noting support for common formats such as TorchScript and ONNX in OpenSearch, plus built-in and third-party model support in Elasticsearch.

GPU-assisted index creation and vector storage

For building large vector indexes, GPU-based indexing is significantly faster than CPU-based processing. The talk highlights Nvidia cuVS/Cagra as the mechanism both products use to accelerate index creation and graph construction. It also covers a common storage issue: vectors may be stored both in the search index structure and in _source, doubling storage usage. Disabling source storage for vectors, or using derived source behavior, reduces disk use and can simplify large deployments.

Similarity search, ANN, and indexing libraries

At query time, the system embeds the user query and searches for the closest vectors using a distance metric. The speaker mentions cosine similarity, Euclidean distance, and Hamming distance, but recommends cosine for most use cases and advises against Euclidean in high-dimensional spaces. Since brute-force kNN does not scale, the talk focuses on approximate nearest neighbor (ANN) methods such as IVF and HNSW. It also explains that Elasticsearch and OpenSearch rely on libraries such as Lucene, Faiss, and NMSLIB to implement these algorithms.

Quantization and memory reduction techniques

The most important operational topic in the talk is memory reduction. Quantization reduces the precision of vectors from float values to smaller representations such as int8 or binary forms. The speaker walks through scalar quantization, product quantization, and binary quantization, showing how each approach lowers RAM usage at the cost of some recall. He emphasizes reranking as a second-stage retrieval strategy: search the compressed index first, then rerank a smaller candidate set using the full-precision vectors.

Disk-based vector search and operational tradeoffs

When memory is still not enough, the talk moves to disk-oriented search. Elasticsearch’s disk-based BBQ and OpenSearch’s memory-optimized and disk-based search approaches are presented as ways to shift some of the burden from RAM to disk. The speaker also warns that large vector indexes interact with JVM heap sizing, filesystem cache, and the number of shards/segments. Keeping the number of segments low with force merge can reduce the number of graphs the search engine must traverse and improve latency.

Practical conclusion for production teams

The closing advice is pragmatic: semantic search is useful, but expensive at scale. Lexical search remains relevant and often simpler to operate. For billion-scale embeddings, teams should combine careful model selection, chunking, ANN search, quantization, reranking, source-field optimization, and updated cluster versions. The speaker’s overall recommendation is to treat vector search as an evolving storage-and-retrieval system, not just a model problem.

Keywords: elasticsearch vector search, opensearch vector search, billion-scale embeddings, semantic search at scale, approximate nearest neighbor, ann search, hnsw, ivf indexing, faiss, lucene vector search, nmslib, vector quantization, scalar quantization, product quantization, binary quantization, disk-based vector search, reranking, embedding model selection, text chunking, token-based chunking, gpu indexing, cuvs, cagra, vector storage optimization, source field vector storage

note