AI February 15, 2019 3 min read

From hype to inference cost: why AI must be measured as a production function

How to move from evaluating AI by its demo effect to evaluating it by the real economics of running a model in production.

Most conversations about AI projects circle around two moments: excitement during the demonstration, and counting the cost of training the model. Between them - and after them - one question goes unnoticed, even though it turns out to be the most important one in real operation.

What does a single model call cost? How many such calls are needed per day? What happens to the economics when the load grows ten times?

This is inference - the work of a trained model running in production. This is where the real cost of AI lives.

Why inference is undervalued

When a model is trained and shows good results on test sets, it feels like the hard part is behind you. The model exists. It works. All that remains is to "just integrate it."

But training is a one-time effort. Inference is a continuous operational process. Every time a user or a system requests a result from the model, a computation happens. That computation requires resources. Those resources cost money and time.

At small scale this is invisible. At industrial scale it can be the main line item in operational costs.

How to think about inference as a production function

Manufacturing has the concept of unit cost - the cost of producing one item. For an AI system the equivalent is the cost of one inference request. And it depends on several factors.

Model size. More complex architectures give better quality but require more computation per request. Sometimes a smaller model with slightly lower quality delivers ten times better economics for the same business result.

Infrastructure. GPUs are significantly faster than CPUs for most AI tasks, but they are priced differently. Cloud GPUs vary considerably by type, region, and pricing scheme - by time or by request.

Batching. Processing requests in groups reduces the per-unit cost. But it introduces a delay while the batch accumulates, which is not always acceptable for real-time use.

Caching. Many AI tasks have repeating patterns. Caching results can reduce the number of actual computations by an order of magnitude - without any loss of quality for the user.

Questions to ask before going to production

Before an AI solution moves from pilot to production, I recommend having clear answers to a few questions.

What is the expected load - how many requests per hour on average and at peak? Without this, any cost estimate is a guess.

What is the acceptable response latency? This determines whether batching is possible or whether real-time mode is required.

What is the cost of one request at the current architecture? If there is no answer, nobody has counted.

How does the cost change when load grows two, five, ten times? Is it linear, or are there switching points?

Is there a lighter model alternative that gives acceptable quality? Often 80% of a large model's quality is achievable at 10% of the cost.

Production thinking versus research thinking

Academic and demonstration thinking about AI optimises for model quality. Production thinking optimises for the ratio of quality to cost per unit of useful work.

This does not mean taking the cheapest model. It means understanding what you are paying for and what you are getting. For most business tasks, sufficient accuracy and predictable cost matter more than maximum accuracy with unpredictable spend.

An AI system in production is an asset with operating costs. It needs to be managed the same way you manage any other operational asset.

Back to all posts

Contact

Why inference is undervalued

How to think about inference as a production function

Questions to ask before going to production

Production thinking versus research thinking

If this resonated, write to me. I reply personally.