LLM operational economics: how to model costs before you scale
Why token costs for language models need to be modelled in advance, and how to avoid an unexpected invoice when load grows.
Most companies that start building products on language models follow the same path. The pilot works, results are good, the decision is made to scale. And then the API invoice turns out to be ten times larger than expected. Not because the technology is bad - but because nobody modelled the operational economics.
This is not an exotic problem. It is a question worth asking before a feature goes into production.
What makes up the cost
LLM APIs are priced by tokens. A token is roughly 4 characters of English text. The price is calculated separately for incoming tokens (the request) and outgoing tokens (the model's response).
For a single request the cost is negligible. The problem appears at scale. A thousand requests a day is a different order. A hundred thousand - another order still.
Several factors strongly influence the total:
Context length. If every request carries a long system prompt or conversation history - that is hundreds or thousands of tokens of overhead per call. Multiplied by the number of requests.
RAG context. If retrieval-augmented generation is used, each request carries extracted document fragments. Several fragments of 500 tokens each add another 1-2 thousand tokens per request.
Response length. Model responses cost more per token than incoming tokens with most providers. If the task involves detailed answers, this is an important multiplier.
Model choice. GPT-4 is several times more expensive than GPT-3.5. For tasks where GPT-3.5 quality is sufficient - overpaying for GPT-4 is not justified.
A simple model for estimation
Before going into production, it is worth doing a basic calculation.
Take a typical request and count its cost: system prompt + context + user query + expected response. Multiply by the expected number of requests per day, per week, per month.
This gives an order of magnitude. If the order of magnitude is unpleasantly surprising - optimise before launch, not after.
A few levers that actually work: caching responses to common requests; compressing the system prompt without losing meaning; using a cheaper model where the expensive one's quality is not needed; limiting response length where a short answer is sufficient.
About unexpected scenarios
Real usage often differs from expected usage. Users ask longer questions. Conversation context grows. Someone on the team runs an experimental script that makes a thousand calls.
Spend limits at the API key level are a required tool. Not out of paranoia, but as basic operational hygiene. An unexpected invoice for a few thousand dollars is unpleasant. An unexpected invoice for tens of thousands is an incident.
Questions before launch
- What is the expected load - number of requests per day at normal and peak usage?
- What is the typical total length of a single model call including all context?
- Is the quality of the chosen model worth the price difference for this specific task?
- Is there a mechanism for alerting when a spend threshold is crossed?
- Is there an answer to "what if usage grows tenfold" - will the budget hold?
Answering these questions does not require complex modelling. But their absence regularly leads to uncomfortable conversations with the CFO.