m@ksim.pro
Back to all posts
AI 4 min read

Controlling LLM costs in production: token budgets and request design

Practical ways to reduce the API bill for LLM-powered features without degrading quality - focused on what actually moves the needle in real deployments.

The AI feature looked cheap in the prototype. A few thousand tokens per request, a handful of daily users on the team, barely a blip on the invoice. Then the feature went live, usage grew, and suddenly the line item was noticeable. Then it was uncomfortable. This is a common trajectory, and the fix is almost never "use a cheaper model" - at least not as a first step.

Token cost is a design problem before it is a model selection problem.

Where the tokens actually go

The first step is measurement. Most teams know their total monthly spend but do not know which requests account for what share. In my experience, the distribution is almost always skewed: a small number of request types consume a disproportionate share of tokens.

Instrument your calls to log: prompt token count, completion token count, which feature or flow triggered the call, and whether the response was actually used. "Response was actually used" matters more than you might expect - I regularly find cases where a model is called on every page load and the result is displayed only when the user clicks something that most of them never click.

System prompt inflation

The most common source of avoidable token cost is a system prompt that has grown without discipline. Every clarification, every edge case, every "and also make sure to..." added over six months of iteration is still there, being sent with every request.

Audit your system prompts as text. Remove instructions that are covered by other instructions. Remove context that is only relevant for edge cases you can handle with post-processing instead. Remove examples that were added to fix a specific case and are no longer needed.

A well-curated 500-token system prompt often performs as well as a 2,000-token one - and costs 75% less in the fixed portion of every request.

Request structure and caching

If your requests have a stable prefix - a system prompt, a large knowledge base section, a fixed set of examples - check whether the API you are using supports prompt caching. Anthropic, OpenAI, and Google all offer prefix caching in some form. For workloads where the same prefix appears in thousands of daily requests, the savings are significant.

For RAG workloads: review how many retrieved chunks you are sending. More context is not always better. A retrieval pipeline that sends the top 10 chunks when the top 3 would answer the question is costing you more and potentially degrading quality through noise.

Model routing

Not every request needs your best and most expensive model. A common pattern:

  • Use a large, capable model for complex tasks: drafting, synthesis, multi-step reasoning.
  • Use a smaller, cheaper model for classification, intent detection, formatting, and simple extraction.
  • Cache responses for identical or near-identical inputs where freshness is not critical.

The challenge is building the routing logic. A simple classifier that decides which tier a request goes to can pay for itself quickly if your request mix is right. The key question: what are the requests where quality degradation is acceptable, and what are the ones where it is not?

Output length control

Completion tokens cost the same as prompt tokens at most providers. Long, verbose responses to requests that only need a short answer are avoidable.

Explicit length instructions in the system prompt help: "respond in two to three sentences unless the complexity of the question requires more." JSON output schemas with constrained fields help more - they prevent the model from wrapping the actual answer in narrative it does not need.

The monitoring requirement

None of this works without ongoing monitoring. Token costs shift as usage patterns change. A new user segment may interact with your feature in a way that generates much longer completions than you designed for. A prompt change to improve quality may have doubled token usage without anyone noticing.

Set budget alerts. Review the distribution of request sizes monthly. Treat token efficiency as a quality metric alongside accuracy and latency - not because every cent matters, but because runaway costs are usually a symptom of design drift worth catching early.

Back to all posts
Contact

If this resonated, write to me. I reply personally.

WhatsApp