Multimodal models: what is actually useful for business right now
A practical look at AI models that work with text and images together - without the marketing fog.
Language models that work with not just text but also images, video, and audio are called multimodal. GPT-4 with Vision, Claude with image analysis, Gemini - these all belong to this class. In 2024, they have become accessible and stable enough to talk about real workflow applications, not just experiments.
I am deliberately not writing about "revolution" and "transformation." I am writing about where this already works and what a manager needs to understand before deploying it.
What multimodality means in practice
A multimodal model accepts and processes multiple types of input within a single request. You can pass an image together with a text question, and the model answers taking both into account.
Practical implications: the model can read a screenshot and explain what is in it, interpret a diagram or chart, describe a photograph, compare two visual documents, extract data from a scanned document or a handwritten table.
This is not magic. It is a specific tool with specific limitations.
Where this already works
The most practically mature scenarios are those where a human was previously needed to translate visual information into text or structured form.
Document processing: invoices, acts, bills, PDF reports, scanned contracts. Instead of manual data entry - a request to the model with an image of the document. Accuracy depends on scan quality and format complexity, but for standard documents it is already sufficient for initial processing.
Quality inspection by photo: in manufacturing, construction, logistics - comparing photos against a standard, detecting deviations, describing damage. This does not replace a specialist for complex cases, but it handles the routine volume.
Extracting structured data from non-standard forms: when suppliers send documents in their own format, a multimodal model can extract the needed fields without manual mapping.
Review of visual materials: describing slide content, analyzing marketing materials, interpreting interface screenshots.
Where limitations remain
Accuracy with small text, complex tables, or non-standard fonts remains unstable - especially at low resolution.
Result verification is mandatory where errors have consequences. A multimodal model can confidently misread a number. Automatic document processing without a quality control layer is a risk.
Regulatory constraints: passing document images to external APIs touches on confidentiality and in some cases regulatory requirements. This needs to be verified before deployment, not after.
How to evaluate a potential use case
Before launching a pilot, I recommend answering these questions:
- What specific manual process involving visual data do we want to automate?
- What is the volume - does the savings justify the implementation cost?
- What does an error look like and what does it cost?
- Is passing images to an external service acceptable from a confidentiality standpoint?
- Who will verify output quality and how?
Multimodality is an expansion of the toolset, not a separate revolution. Where a model previously could not work with a document without preprocessing it into text, now it can. This narrows the set of tasks that require special preprocessing and expands what can be automated directly.