How to evaluate an AI vendor when buying: a working filter
A set of questions and criteria for a manager choosing an AI solution or contractor - without relying on demos and marketing materials.
The AI solutions market right now has a structural feature: there are many offerings, demos work convincingly, and the gap between a product and a prototype is often invisible without the right questions. This is not a complaint about the market - it is simply where the market is at this stage.
For a manager choosing an AI solution or contractor, this means: standard procurement procedures are not enough. You need an additional filter specific to this class of technology.
Why a standard tender process does not work
A standard tender evaluates functional requirements coverage, price, company reputation, and delivery timeline. For AI systems this is insufficient for several reasons.
First: functionality in a demo and functionality on your data are different things. A language model that answers general questions brilliantly may produce unacceptable results in your corporate context.
Second: AI system quality degrades over time if not maintained. Data changes, context changes, models become outdated. This creates an operational burden that does not exist in traditional software.
Third: responsibility for AI system errors is an open question that needs to be addressed in the contract, not assumed by default.
Block 1: Technical maturity assessment
The first group of questions checks whether there is a real product behind the demo.
- What data was the system trained or configured on? Is data similar to ours represented in the training set?
- How does the system behave on inputs that differ from the demo? Show us queries where it fails.
- What is the quality metric - and who measures it? How has it changed over the last 6 months?
- How is feedback and improvement structured - is there a retraining or fine-tuning process?
Block 2: Operational readiness
The second group evaluates what happens after launch.
- What does the SLA look like - not just uptime, but response time when quality degrades?
- How is answer quality monitored in production? Who notices if the system starts producing poor responses?
- What does the model update plan look like, and how is it coordinated with us in advance?
- What is the rollback process if quality degrades after an update?
Block 3: Data and privacy
The third group covers questions about data passed to the system.
- What data from our queries is used for model fine-tuning? By default or with consent?
- Where is our data stored? In which jurisdictions is it processed?
- How is data isolation structured between clients in a multi-tenant system?
- Does the data processing comply with our regulators' requirements?
Block 4: Accountability and contract
The fourth group covers what is often left for later, but is better discussed before signing.
- How does the contract describe accountability for system errors in critical decisions?
- What happens to our data upon contract termination?
- Are there clauses allowing unilateral changes to terms - especially around API access and pricing?
A practical test
The best way to evaluate an AI vendor is to ask for a pilot on your real data with a measurable result. Not a general demo - a specific task from your actual context.
If a vendor avoids such a pilot or cannot agree on evaluation metrics in advance - that is an informative answer in itself.