Data October 7, 2024 4 min read

Vector databases: what they actually store and when you need one

A plain explanation of what vector embeddings are, what vector databases do differently from relational or document stores, and when the technology is worth adding to your stack.

"We need a vector database" has become a common line in AI project proposals. Sometimes it is justified. Often it is cargo-cult infrastructure added because the blog post about RAG mentioned Pinecone, and now Pinecone is in the architecture diagram without anyone explaining what it stores or why.

This post is my attempt to give a manager or technical owner a clear enough model to ask the right questions.

What an embedding actually is

When a language model or an embedding model processes text, it converts that text into a list of numbers - typically a few hundred to a few thousand of them. This list is called a vector, and it encodes something like the "meaning" of the text in a geometric space.

The key property: texts that mean similar things end up with vectors that are close to each other in that space. "Contract termination clause" and "agreement cancellation provision" will have vectors that are nearby. "Invoice payment terms" will be further away. "Dog food recall" will be far away.

This similarity is measurable. That measurement is what vector databases are built to perform efficiently at scale.

What a vector database does

A vector database stores vectors (and usually the original text or document alongside them) and is optimized for one specific kind of query: "given this vector, find the N vectors in the database that are most similar to it."

A traditional relational database can store vectors too, but its indexes are built for exact matches and range queries. Similarity search over millions of high-dimensional vectors requires different data structures - approximate nearest neighbor indexes - which is what vector databases specialize in.

When you actually need one

You need vector similarity search when:

You are building a semantic search feature where users expect to find documents by meaning, not keyword.
You are building a RAG system and your corpus is large enough that putting everything in the context window is not an option.
You are doing recommendation at scale based on content similarity.
You are detecting near-duplicates or clustering documents by topic.

You probably do not need a dedicated vector database when:

Your corpus is small enough to fit in the context window of the model you are using.
You only need keyword search with some ranking.
You are prototyping - pgvector in Postgres handles modest workloads perfectly well and removes one operational dependency.

The practical architecture

In a typical RAG setup:

You process your documents through an embedding model and store the resulting vectors in a vector database alongside the original text chunks.
At query time, you embed the user's question and retrieve the N most similar chunks from the vector database.
You put those chunks in the context window of the generative model along with the question.
The model answers based on that context.

The quality of this pipeline depends heavily on how you chunk your documents and which embedding model you use - not just on the vector database itself. A lot of poor RAG performance I see in practice traces back to naive chunking (splitting by character count with no regard for semantic boundaries), not to database choice.

The operational reality

Vector databases are a relatively young category. The leading options - Pinecone, Weaviate, Qdrant, Chroma, Milvus - are all reasonable but differ in hosting model (managed vs self-hosted), filtering capabilities, and update performance.

pgvector remains my default recommendation for projects where the corpus fits in a few million vectors and the team is already running Postgres. It is one less service to operate. Once you exceed that scale or need features like multi-tenancy with strong isolation, purpose-built options become worth their operational cost.

The question to ask is not "which vector database should we use" but "do we need a dedicated one at all, given our actual data volume?"

Back to all posts

Contact

What an embedding actually is

What a vector database does

When you actually need one

The practical architecture

The operational reality

If this resonated, write to me. I reply personally.