Use Cases · May 25, 2026 · 11 min read

The AI vendor evaluation framework: 15-criterion scorecard

Most AI RFPs miss the things that matter in production. A scorecard you can copy, the red flags to watch, and reference-call questions that surface the truth.

By Xwits Editorial · Reviewed by Deep Parmar, Founder · Last reviewed May 25, 2026

TL;DR

Most AI RFPs miss the things that matter in production.
15 criteria across four areas: capability, operations, contract, and team. Score each 1-5.
Five red flags in any demo. Reference-call questions that surface the truth.
Take this on your next AI vendor call. It will earn its place in 30 minutes.

Quick answer

How do I evaluate an AI vendor?

Score every AI vendor on 15 criteria across capability (does it work?), operations (will it stay working?), contract (can I leave?), and team (who will I actually be working with?). The total tells you who to shortlist. The pattern of low scores tells you where each vendor is weak. Most teams skip 8-10 of these criteria and find out the hard way after signing.

Buying AI software is harder than buying regular SaaS. The demo always looks impressive. The production reality varies wildly. Below is the 15-criterion scorecard we use internally — and what we would want every customer evaluating Xwits to use against us.

The four areas

Capability — does the AI actually work for your case?

5 criteria, 25 points.

Domain fit. Has the vendor shipped this AI in your specific industry / use case before? "We can build this" is a different answer from "we already have this running."
Eval transparency. Can the vendor show you their internal eval set with specific accuracy numbers for tasks similar to yours? Vague "high accuracy" answers are a flag.
Custom data handling. How does the AI handle your specific data — fine-tune, RAG, prompt? Is this configurable, or fixed in the platform?
Edge case behaviour. What does the AI do when it is uncertain? Refuse? Guess? Escalate to a human? Show me the cases where it has gone wrong in production.
Multi-language / multi-region. If you operate across languages or regions, does the AI handle that or only English / single-region?

Operations — will it stay working in your environment?

4 criteria, 20 points.

Observability. Can you see every AI action — input, output, model version, cost, human reviewer? Without this, debugging is impossible. See our production-AI properties.
Guardrails. What inputs does the AI refuse? What outputs are blocked? Are these configurable per tenant?
Cost ceilings. Can you cap AI spend at request, tenant, feature, and global levels? Or is spend uncapped until you discover the overrun?
Off-switch. Can you disable an AI feature per tenant / per feature without breaking the rest of the product? Test this in the demo.

Contract — can you leave?

3 criteria, 15 points.

Data export. What format? How fast? Does it include the embeddings, the fine-tunes, the model weights?
IP ownership. Who owns the code, the fine-tuned model, the embeddings created from your data?
Exit cost. What is the realistic 6-month cost of migrating to an alternative? Include retraining, data migration, integration work.

Team — who will you actually work with?

3 criteria, 15 points.

Engineer access. When something breaks, do you talk to an engineer or a support agent reading from a script? Test this — file a hard question and see who responds.
Roadmap influence. What weight do customer requests have on the roadmap? Founder-led companies usually do this well; enterprise SaaS usually does not.
Time to first reply. Send a sales-channel question + a support-channel question. Measure the actual time. AI vendors who reply in 24 hours behave differently in production than those who reply in a week.

The scorecard

Each criterion: 1 = bad / 5 = excellent.

60-75 (out of 75): Strong shortlist candidate. Pursue.
45-59: Probably workable for non-critical use cases. Negotiate the contract aggressively on the weak criteria.
30-44: Risky. Only pursue if no alternatives or if you can accept the gaps in writing.
Below 30: Skip. The hidden cost of fixing weak fundamentals exceeds the visible cost of the AI itself.

Five red flags in any demo

1. "Trust me, the AI works"

No eval numbers, no specific accuracy figures. Production reality will be a coin flip.

2. "We will customise this for you"

Translation: the platform is rigid. Custom work means a six-month consulting engagement on top of the platform price.

3. "Our customers love us" (no specific customers named)

Reference customers should be findable. Specific named customers are normal. Vague "Fortune 500 clients" is not.

4. "Pricing depends on your usage" with no specific numbers

Every credible AI vendor has at least a per-token or per-task cost they will share. Avoiding the question means the price is high enough to be embarrassing.

5. The salesperson cannot answer a technical question

Ask: "How does your AI handle prompt injection?" If the answer is hand-waving, the platform probably does not handle it well either.

The reference-call questions that work

When the vendor offers a reference customer, ask:

What is the AI actually doing for you in production?
What does the AI do when it gets a case it cannot handle?
How many hours per week does your team spend on the AI itself (configuration, review, edge cases)?
When did the vendor break a commitment? How was it handled?
If you had to make this decision again, would you?

Question 4 is the most revealing. Every vendor breaks a commitment occasionally. How they handle it tells you everything.

How Xwits scores against this

We use this exact scorecard internally to pressure-test our own product. Here is what we would expect a partner to score us at today (honest assessment):

Capability: 4-4.5 average. The XWorks engine is mature; the vertical apps are still adding features.
Operations: 4.5 average. Observability, guardrails, cost ceilings, off-switch all ship in the platform.
Contract: 5 average. You own the code in custom builds. Data export is part of every contract.
Team: 5 average. Founder-led. Engineer-direct. 24-hour reply is the platform default.

Total: ~64-65 out of 75. Where we are weak — capability on the newer verticals — we say so. We do not pretend to be at 75.

What this means for you

Print the 15 criteria. Score every vendor before signing.
Walk the reference-call questions into every reference call.
Trust the pattern of weak scores, not just the total.
Read our build-vs-buy framework before you start vendor evaluation — make sure you should buy at all.
For honest pricing context: how much does custom AI cost.

Evaluating Xwits? Run this scorecard on us. Book a 30-minute call and we will answer every criterion honestly.

Talk to us about a build→

Keep reading