AI March 7, 2023 3 min read

RAG: how retrieval-augmented generation actually works

Before building a chatbot over your own documents, it helps to understand what RAG does, what it does not do, and where the failure points are.

After GPT-4 launched, the most common question I started hearing was: "Can we make it answer questions using our own internal documents?" The short answer is yes. The longer answer is that there is a specific architectural pattern for this, and if you skip any of its steps, the results look impressive in a demo and break in production.

That pattern is called RAG - retrieval-augmented generation.

What the name means

The idea is straightforward. A language model, on its own, only knows what it was trained on. It has no access to your contracts, your product wiki, your support history. RAG fixes that by adding a retrieval step before the model generates an answer.

The flow looks like this: the user asks a question, the system searches your documents for the most relevant passages, those passages are inserted into the prompt alongside the question, and only then the model generates an answer. The model does not "read your database" - it reads a small chunk of retrieved context each time.

The three components you actually need to build

An indexing pipeline. Your documents need to be chunked into smaller pieces, converted into vector embeddings, and stored in a vector database. This is a one-time process that you rerun whenever your documents change. The chunking strategy matters more than most people expect - chunks that are too long lose precision, chunks that are too short lose context.

A retrieval step. When a question arrives, it is converted into a vector embedding using the same model, and the database returns the most similar chunks. The quality of retrieval is what determines the quality of the final answer. A great model cannot rescue poor retrieval.

A generation step. The retrieved chunks and the question are combined into a prompt. The model synthesises an answer from what it was given. If the right information was not retrieved, the model will either say it does not know (good) or will confabulate something that sounds plausible (bad).

Where projects fail

The most common mistake is treating retrieval as a solved problem and spending all engineering effort on the model integration. In practice, improving retrieval - better chunking, better embeddings, better reranking - delivers far more value than switching between model versions.

The second common mistake is ignoring the document pipeline. If your source documents are inconsistent, poorly structured, or full of stale information, RAG will retrieve and present that stale information with confidence.

The third is skipping evaluation. It is not enough to click through a few example questions manually. You need a small but representative test set, a way to measure recall, and a way to catch regressions when you change the pipeline.

A practical starting point

If you are considering RAG for an internal use case, the questions worth answering first are:

What are the actual documents - their format, update frequency, and who owns them?
What does a "good answer" look like, and who will judge it?
How will you handle the case where the answer is genuinely not in the documents?

Getting those three questions answered before writing any code saves a lot of backtracking later.

Back to all posts

Contact

What the name means

The three components you actually need to build

Where projects fail

A practical starting point

If this resonated, write to me. I reply personally.