LLM context windows: what the limit means for business applications
Why the context window constraint in language models is not a technical footnote but an architectural decision that determines what can actually be built.
When a conversation turns to integrating a language model into a real product or process, one of the first technical terms that comes up is "context window". Engineers mention it as though it is obvious. Managers nod without fully grasping why it matters.
I want to explain why it matters - and why this specific constraint determines what you can actually build versus what only looks buildable in a demo.
What a context window means in plain language
A language model does not "remember" in the ordinary sense. It processes what you hand it right now. The context window is the maximum volume of text the model sees in a single pass: your question, the conversation history, any documents you attached, and the system instruction.
At the start of 2024, a typical context window for mainstream models is several thousand tokens, with some specialised models reaching one hundred thousand. A token is roughly 0.7 English words. Several thousand tokens is about four to six pages of text. One hundred thousand is around seventy pages.
That sounds large until you start attaching real documents.
Where this limit shows up in practice
The most common scenario I see with clients: "we want a chatbot that answers questions about our documentation." The documentation is five hundred pages of procedures, a product catalogue, and an FAQ base. None of that fits in the context window at once.
This forces three fundamentally different architectural choices:
The first - you select and pass only the relevant fragments in response to each specific question. This is called RAG, retrieval-augmented generation. The model sees not the entire documentation but what the search found for the query. It works well when the search works well.
The second - you fine-tune the model on your data so the knowledge is baked into the weights rather than passed in context. This is expensive, requires specialists, and the knowledge goes stale with the training.
The third - you accept the constraint as a given and design the task so it fits in the window. Often this is the best choice. It just requires an honest look at what the task actually is.
Why the demo looks better than production
In a demo, the engineer carefully picks questions that land precisely in pre-prepared context. Everything runs smoothly. In production, a user asks something that requires information from five different parts of the documentation, plus the history of previous conversations, plus the current status of their order. All of that together does not fit.
I have seen several projects where the team only discovered three months into development that the intended usage scenario physically did not fit the architecture built on demo logic.
How to think about this when evaluating a project
A few questions worth asking before the team starts building:
- How much context does a single working interaction need - how much text must the model see at once?
- Where does that text come from: a knowledge base, conversation history, external systems?
- What happens when not all the needed context fits in the window - what failures does that produce?
- Who is responsible for making sure the search for the right fragments works reliably?
- How will the data volume change in a year - will today's "fits" become "does not fit"?
A simple test
Ask the team to describe a typical production scenario: what the user will actually ask, what text the model needs to answer it, and where that text comes from. If that description does not exist, the architectural decision has not been made yet - even if development is already underway.