AI July 5, 2024 3 min read

RAG in production: why a large context window does not solve the problem

Why RAG architecture often disappoints in production, and where the real bottleneck sits.

Among the architectures that have gained popularity alongside the rise of language models, RAG - Retrieval-Augmented Generation - is one of the most talked about. The idea is simple: instead of fine-tuning a model on your data, find the relevant chunks and pass them into the prompt context. It sounds like a ready-made answer for enterprise search, internal knowledge bases, or documentation Q&A.

On pilots it looks convincing. At real scale the picture changes.

What RAG is and why it attracts

The approach works like this: break corporate documents into fragments, turn them into vector embeddings, and when a question arrives, retrieve the closest fragments by meaning and feed them to the model alongside the question. The model answers drawing on the retrieved context.

This removes several pain points: no need to re-fine-tune an expensive model every time documentation changes, data can be updated independently, and answers can be linked back to a source.

The appeal is clear. But appeal is not the same as production readiness.

Where it actually breaks

The first bottleneck is chunking quality. Corporate documents are not written to be split into convenient pieces. Long policies with cross-references, tables filled with acronyms, context that only makes sense if you know the background - all of this gets cut without regard for meaning. The model receives fragments that cannot be answered without knowledge of what surrounds them.

The second bottleneck is retrieval quality. Vector search finds semantically similar content, but not always the right content. If a user asks "what to do when a supplier fails" and the document describes this as "counterparty risk management procedure", the search may miss entirely.

The third bottleneck is the quality of the documents themselves. RAG amplifies what is already there. If the documentation is outdated, contradictory, or written with unstated assumptions for an internal audience - the model will produce confident answers based on bad material.

Why growing context windows do not fix the problem

I often hear the argument: "Soon models will hold the entire document corpus in context - then RAG will not be needed." This is technically possible in narrow cases, but it does not remove the source quality problem. A larger context passes more material, but the model still works with whatever is in it. Bad data in a large context produces the same bad results, just at higher cost.

Beyond that, long context creates its own effects: models extract information less reliably from the middle of a long context than from the beginning or end. This is a documented phenomenon, not a theory.

What to check before launching

Before investing in a RAG system, I ask a few questions:

Who is responsible for keeping the documents in the knowledge base current?
Is there an update process - or is this a one-time load and see what happens?
How is answer quality monitored - who catches hallucinations, and how?
Do users understand the system's limits, or will they trust any answer unconditionally?
What happens when the system answers confidently and incorrectly - what is the cost of that?

RAG is a good technique. But like any tool, it solves a specific part of the problem. The part involving data quality and the processes around it does not go away.

Back to all posts

Contact

What RAG is and why it attracts

Where it actually breaks

Why growing context windows do not fix the problem

What to check before launching

If this resonated, write to me. I reply personally.