RAG in production: what architects actually need to decide
Retrieval-augmented generation has become the default pattern for corporate AI. I break down what architectural decisions actually matter and where the traps are.
Retrieval-augmented generation - RAG - has become the standard answer to the question "how do we connect an LLM to our company's data?" The pattern is genuinely useful. But in early 2025 I see the same mistake repeated: teams choose a vector database, embed some documents, and consider the architecture done. Then the system goes live, quality is poor, and no one is sure where to look.
The issue is that RAG is not a technology choice. It is a set of design decisions that interact with each other. Getting one wrong degrades the whole chain.
What RAG actually consists of
A working RAG system has at least four distinct stages: ingestion, retrieval, augmentation, and generation. Most attention goes to the last one - the model. But the problems almost always live in the first two.
Ingestion is where documents become searchable chunks. How you split a document matters enormously. Too small, and context is lost. Too large, and the retrieved chunk is diluted with irrelevant text. The right chunking strategy depends on document type - a legal contract chunks differently from a product FAQ.
Retrieval is where a user query is matched to stored chunks. Pure vector similarity works well when the question and the answer use similar vocabulary. It fails when the question is abstract and the answer is specific, or when exact terminology matters. Hybrid retrieval - combining vector search with keyword search - handles a much wider range of real queries.
The metadata problem
Vector search finds semantically similar text. It does not know that a policy document was superseded last quarter, that a price list applies only to region X, or that a technical specification is under revision. That context has to come from metadata.
In most first implementations, metadata is an afterthought. The result is a system that confidently retrieves outdated or scoped-to-the-wrong-audience content. Users see a confident answer built on the wrong source.
Designing the metadata schema before ingestion - not after - is one of the highest-leverage decisions in a RAG project.
Evaluation is not optional
A common pattern I see: the team runs a demo, it looks good, and the system ships. Then slowly, complaints come in about wrong answers.
RAG systems need an evaluation framework from the start. That means a test set of real questions with known good answers, and metrics for both retrieval quality (did we find the right chunks?) and generation quality (did the model use them correctly?). Without this, tuning is guesswork and regressions are invisible.
This is not complicated to set up, but it requires someone to own it. Usually nobody does until there is already a problem.
Where to start
If I am advising a team starting a RAG project in 2025, I prioritise in this order:
- Define the use case tightly - "access to all company knowledge" is not a use case.
- Audit the source documents: format, update frequency, access scope, version history.
- Design the metadata schema that supports your filtering requirements.
- Choose chunking strategy per document type, not one size for everything.
- Build a small evaluation set before writing the first line of retrieval code.
The model choice is usually the last decision, not the first. Most quality problems in enterprise RAG have nothing to do with which LLM you picked.