Data September 11, 2023 3 min read

What to understand about embeddings before launching vector search

Why choosing an embedding model is not a technical detail for later, but an architectural decision with long-term consequences.

When a company decides to build a document search system using language models, the conversation quickly moves to the vector database: Pinecone, Weaviate, Chroma, something else. That is natural - the vector database is visible, it gets chosen, it gets evaluated. But there is a more fundamental choice that happens earlier and that you live with longer. That is the choice of embedding model.

Embeddings are numerical representations of text that vector search operates on. The quality of the entire search system depends on how they are generated. And changing them later means re-indexing the entire document corpus from scratch.

Why this decision should not be deferred

The process is simple: text - embedding model - vector - stored in the vector database. When a user searches, their query goes through the same embedding model, producing a query vector, and the system finds the nearest vectors in the index.

The key point: query and documents must go through the same model. If you use one model for indexing today and want to try a better one tomorrow - you need to re-index everything. For a corpus of a few thousand documents that is manageable. For hundreds of thousands it is an operation.

This is why the embedding model choice is an architectural decision, not an implementation detail.

The main axes of choice

Vector dimensionality. Higher means more information in the vector, but also higher memory and search-speed requirements. Modern models typically produce 768 or 1536 dimensions. For most tasks this is a workable range.

Language support. If the documents are in a language other than English, you need a model trained on that language. Multilingual models exist, but their quality for individual languages is usually lower than specialised ones. For a mixed corpus - a separate question.

Domain specialisation. General models are trained on the internet. If your documents are legal, medical, or technical, a general model may handle specialist terminology poorly. Domain-tuned models give better quality, but there are fewer of them and they are more expensive to maintain.

Local model or API. Some embedding models are available through an API (OpenAI's text-embedding-ada-002, for example). Others run locally. This brings us back to the data confidentiality question: if documents cannot leave the perimeter, a local model is needed.

What is commonly missed

Evaluating embedding quality is a step that cannot be skipped. A model may look good on standard benchmarks and perform poorly on your specific corpus and your typical queries. The right approach is to take a few candidate models, build a small test index, and run real queries through it. This takes time but saves disappointment later.

The cost of generating embeddings. If using an API, each document costs something at every re-indexing. For large corpora with frequent changes this can add up to significant amounts.

A practical approach

Before choosing a tool:

What is the corpus size - now and in a year?
What languages are the documents in?
Are there restrictions on sending data outside the company?
How often will the corpus be updated?
Is there specific terminology the model must understand?

Answers to these questions will narrow the choice to a few real candidates. Then you can experiment with specific models on specific data - rather than choosing by internet popularity.

Back to all posts

Contact

Why this decision should not be deferred

The main axes of choice

What is commonly missed

A practical approach

If this resonated, write to me. I reply personally.