GPT-4o and the normalisation of real-time multimodal UX
What the GPT-4o announcement means for companies designing AI-powered interfaces: voice, vision, and text in a single stream is becoming a standard expectation.
In May 2024, OpenAI announced GPT-4o - a model that processes text, voice, and images in a single unified stream, without switching between separate systems. The demonstration was convincing: real-time conversation with the model, responsiveness to emotional tone in the voice, discussion of what is visible on screen.
I want to talk not about the technical details of the model, but about what this announcement means for any product or service that has a user-facing interface.
What changed in perception
Before GPT-4o, multimodality in AI existed but was composite: a separate model for speech transcription, a separate one for generating the response, a separate one for synthesising the voice back. This produced noticeable pauses and breaks in the interaction. The user felt the system as a pipeline, not as a conversational partner.
GPT-4o removes those seams. Voice, image, and text are processed together, and the response comes with a lag of seconds. This is a different class of experience.
What matters to the user is not how it works internally. What matters is that the interaction starts to feel natural. And once enough people have that feeling, it becomes the new baseline expectation.
How this affects product decisions
Companies building products with voice or visual interaction now have to think not just about "does the feature work" but about "how smooth does it feel."
A three-second pause on a voice query used to be acceptable - the user understood the system was "thinking." Now some users will have experience with GPT-4o at one-second latency, and three seconds will start to feel slow.
This creates pressure on several levels:
Latency becomes a UX metric. The speed of an AI component's response is no longer just a technical characteristic. It affects how the quality of the product is perceived.
Voice stops being exotic. If a voice interface in a business application once looked like innovation, now it is simply one way to interact. The question is not "should we do this" but "when and for which scenarios."
Multimodality changes how scenarios are designed. If the system can see the screen and hear the question simultaneously - that is a different design for support, for onboarding, for an operator's workspace.
What does not need to happen immediately
There is no need to urgently rebuild all interfaces. GPT-4o is publicly available, but not all features are available everywhere at once. Production deployment with controlled latency and predictable behaviour is a separate challenge from a demo.
The right response is not panic and not an immediate relaunch of UX redesign projects. The right response is to accept that user expectations for AI interfaces continue to rise, and build that into product planning on a twelve to eighteen month horizon.
Questions for the product team
- Which interaction scenarios in our product would benefit from voice or visual input?
- What is the current latency in the AI components of our product - are we measuring it?
- How are our users' expectations around speed and naturalness of AI interactions changing over the past six months?
- Is there room in our 2024-2025 product plan for multimodal scenarios?
The normalisation of multimodal UX is happening faster than it seems. It is better to have it in the plan than to be catching up in two years.