GPT-4 and a new conversation about quality, multimodality and the cost of errors
The release of GPT-4 changes not only what language models can do but the conversation about when AI is acceptable in production systems. I look at three key shifts.
On March 14, 2023, OpenAI released GPT-4. The model improved across many dimensions compared to its predecessor. But more important than the technical specifications is how the conversation about the applicability of language models in business contexts has changed.
I want to focus not on benchmarks - those are well covered in the technical press. I am more interested in three shifts that matter for people making adoption decisions.
First shift: quality has crossed a threshold for new classes of tasks
GPT-4 significantly reduced the rate of obvious errors and hallucinations compared to GPT-3.5. This does not mean the model does not make mistakes - it does. But for a range of tasks, quality has crossed a threshold where the tool becomes practically usable.
Concretely this means the following. Tasks that previously required significant time checking model output now need less supervision. For tasks with a low cost of error - drafts, summarisation, initial classification - the model has become viable as part of a workflow.
This does not eliminate the need for review. It changes its scope.
Second shift: multimodality opens new scenarios
GPT-4 supports working with images in addition to text. At launch this capability was limited, but the fact of its existence is an important signal.
For business this means expanding the class of tasks that can be considered for AI automation. Documents with tables, diagrams, product images, scans - all of this can potentially be processed differently than plain text.
In practice I would treat this as a 6-12 month horizon for real pilots right now, not something to deploy immediately.
Third shift: cost of errors becomes the central question
As quality improves, so does the temptation to trust the model more. This is a dangerous shift if not thought about explicitly.
The question for each specific application now has a sharper form: what is the cost of an error from this model in this context? For a draft email - low, an editor will fix it. For a customer-facing response on behalf of the company - a different story. For a document with legal or financial consequences - another category entirely.
Higher model quality does not remove the need for this analysis. It makes the analysis more nuanced.
What to do now
A practical filter for deciding whether to try GPT-4 in a specific workflow:
- What is the scale of a potential error? What happens if the model produces the wrong result one time in ten?
- Is there a human review point in the process - before the model's output has consequences?
- Can you start with tasks where the cost of error is minimal, and expand gradually?
- How will you monitor model quality in production - not just once at test time, but regularly in operation?
GPT-4 is a new bar for what is possible. But the bar for what is possible and the bar for what is acceptable in a specific business are different things.