AI November 21, 2024 3 min read

How to evaluate an AI vendor when buying: a working filter

A set of questions and criteria for a manager choosing an AI solution or contractor - without relying on demos and marketing materials.

The AI solutions market right now has a structural feature: there are many offerings, demos work convincingly, and the gap between a product and a prototype is often invisible without the right questions. This is not a complaint about the market - it is simply where the market is at this stage.

For a manager choosing an AI solution or contractor, this means: standard procurement procedures are not enough. You need an additional filter specific to this class of technology.

Why a standard tender process does not work

A standard tender evaluates functional requirements coverage, price, company reputation, and delivery timeline. For AI systems this is insufficient for several reasons.

First: functionality in a demo and functionality on your data are different things. A language model that answers general questions brilliantly may produce unacceptable results in your corporate context.

Second: AI system quality degrades over time if not maintained. Data changes, context changes, models become outdated. This creates an operational burden that does not exist in traditional software.

Third: responsibility for AI system errors is an open question that needs to be addressed in the contract, not assumed by default.

Block 1: Technical maturity assessment

The first group of questions checks whether there is a real product behind the demo.

What data was the system trained or configured on? Is data similar to ours represented in the training set?
How does the system behave on inputs that differ from the demo? Show us queries where it fails.
What is the quality metric - and who measures it? How has it changed over the last 6 months?
How is feedback and improvement structured - is there a retraining or fine-tuning process?

Block 2: Operational readiness

The second group evaluates what happens after launch.

What does the SLA look like - not just uptime, but response time when quality degrades?
How is answer quality monitored in production? Who notices if the system starts producing poor responses?
What does the model update plan look like, and how is it coordinated with us in advance?
What is the rollback process if quality degrades after an update?

Block 3: Data and privacy

The third group covers questions about data passed to the system.

What data from our queries is used for model fine-tuning? By default or with consent?
Where is our data stored? In which jurisdictions is it processed?
How is data isolation structured between clients in a multi-tenant system?
Does the data processing comply with our regulators' requirements?

Block 4: Accountability and contract

The fourth group covers what is often left for later, but is better discussed before signing.

How does the contract describe accountability for system errors in critical decisions?
What happens to our data upon contract termination?
Are there clauses allowing unilateral changes to terms - especially around API access and pricing?

A practical test

The best way to evaluate an AI vendor is to ask for a pilot on your real data with a measurable result. Not a general demo - a specific task from your actual context.

If a vendor avoids such a pilot or cannot agree on evaluation metrics in advance - that is an informative answer in itself.

Back to all posts

Contact