m@ksim.pro
Back to all posts
AI 3 min read

Long context in LLMs: what changes for architects and decision-makers

Models with hundred-thousand-token context windows look like a solution to many problems. I break down what this actually changes in practice - and where the traps are.

A few years ago, the standard context limit of a few thousand tokens was a constant source of architectural trade-offs. You had to split documents into chunks, build retrieval over fragments, and carefully manage what context made it into each request. That was its own engineering problem.

Now models offer windows of hundreds of thousands of tokens. At first glance, this removes the problem. Load the whole document and get to work. But in practice, a long context swaps one set of trade-offs for another, and some of the new ones are less obvious.

I want to break down what this means practically for people who are deciding how to build systems on top of LLMs.

What long context actually solves

There are tasks where a large window works well: analysing a single long document, summarising months of email threads, auditing the codebase of a small project. Wherever everything needed fits into one request and requires no complex navigation, long context removes moving parts.

That is genuinely valuable. Fewer parts means fewer failure points and simpler debugging.

Where long context creates new problems

First - cost. A request with hundreds of thousands of tokens costs significantly more than a short one. When a system calls the model frequently, this becomes a material expense quickly. I have seen projects where the euphoria of "now we can load everything" ended with an unpleasant bill.

Second - answer quality. Research shows an effect sometimes called "lost in the middle": models handle information at the start and end of a context window better than information sitting in the centre of a large window. For precise tasks this matters.

Third - latency. A long context means slower inference. For user-facing interfaces where responses are expected in real time, this is noticeable.

Fourth - debugging. When a model gives an unexpected result, finding the cause in a hundred-page prompt is harder than in a short, structured input.

Long context does not replace retrieval

The natural question: if the context is large enough, why build RAG systems with vector search at all?

The answer is that there is a difference between "loading a document" and "working with a corpus". A corpus of thousands of documents still does not fit into any model's window. Retrieval is necessary wherever the scale exceeds a single request.

Retrieval also lets you point precisely to the source of a specific fact - which matters for audit trails and explainability. Long context does not replace that.

How to think about this when designing

Before choosing between long context and retrieval, it is worth answering a few questions:

  1. How much data does a single request actually need in a real usage scenario?
  2. How often will the system call the model?
  3. Is traceability required - "where did this answer come from"?
  4. What latency is acceptable for the user?
  5. How does cost change if load increases tenfold?

Long context is not a replacement for architectural thinking. It is a new tool with its own characteristics. Using it where it gives an advantage and not using it where it only adds cost and complexity - that is sound design.

Back to all posts
Contact

If this resonated, write to me. I reply personally.

WhatsApp