Model accuracy on a static test set is a vanity number. Production success is decided by a set of operational, business, and human metrics that most teams underweight or ignore. Below are the 30 metrics we instrument on every serious engagement — grouped into engineering, business, and human categories — with what each one tells you and what threshold separates a healthy system from one heading for trouble.
Engineering metrics (the system itself)
1. Accuracy / F1 / AUC (or task-specific)
Standard model-quality measures on a held-out test set. Necessary but never sufficient.
2. Calibration
Does a model that says “80% confident” actually get it right 80% of the time? Miscalibrated models produce confident wrong answers.
3. Distribution shift score
Statistical distance between training data and production data, recomputed daily. When this drifts >15%, the model is no longer operating in its training distribution.
4. Feature drift, per feature
Drift on each input feature, with thresholds. Most outages start with one feature drifting silently.
5. Prediction drift
The output distribution itself can shift — model output may stay “correct” on test data but move on production traffic in ways that signal upstream change.
6. Latency p50 / p95 / p99
Median is reassuring; p99 is where customers feel it. The gap between p50 and p99 is the first thing a senior engineer should ask about.
7. Throughput at peak
Real-world load tends to be spiky. Production throughput at the 99th-percentile minute of the day, not the average, is what to plan for.
8. Cost per inference
For LLM systems, tokens per request × token cost. Drives most cost-related incidents.
9. Cost per business outcome
Divide infrastructure cost by the unit of value (per resolved ticket, per loan decision, per converted lead). The metric that maps directly to ROI.
10. Cache hit rate (for RAG / LLM)
Embedding cache, prompt cache, retrieval cache. Hit rates > 30% materially reduce cost and latency.
Quality and safety metrics (LLM- and RAG-specific)
11. Groundedness
For RAG systems: does every claim in the response trace back to retrieved context? Target ≥ 0.90.
12. Context precision
Of retrieved chunks, what fraction were actually relevant. Below 0.7 is a retrieval-pipeline problem.
13. Context recall
Of the chunks the model needed, what fraction were retrieved. Below 0.7 is a chunking or coverage problem.
14. Hallucination rate
Production-sample analysis estimating the fraction of responses with unsupported claims. Target < 2%.
15. Refusal rate
How often the model declines to answer. Should be calibrated: 0% means the model is bluffing; >15% means the user experience is degraded.
16. Jailbreak resistance score
Performance on a maintained suite of jailbreak / prompt-injection probes. Should not regress between releases.
17. PII / secrets leakage rate
Fraction of outputs containing detected PII or credentials. Target zero; alert on any non-zero value.
18. Toxicity / harmful-content rate
Output toxicity score on production samples. Tail-end matters more than average.
19. Bias / fairness across subgroups
Performance gap across protected attributes where applicable. Closing this gap is often more important than improving overall accuracy.
20. Citation rate (where required)
Fraction of responses with at least one source citation. Should be ≥ 95% for knowledge-grounded systems.
Business metrics (what changed in the world)
21. Primary business KPI delta
The single business metric the model is meant to move, vs control. Without this, you do not have a successful model — you have a successful demo.
22. Cycle time reduction
Time to complete the unit of work, before and after deployment. Direct proxy for capacity created.
23. Cost-per-unit before and after
The financial line item the model was meant to affect, measured month-over-month with a control group where possible.
24. Adoption rate
Of eligible users, what percentage actually use the system in a given week. Below 40% adoption, the model is failing operationally regardless of its accuracy.
25. Override / override-reversal rate
How often users override the model’s suggestion, and how often those overrides are later reversed. Captures trust in the model.
26. Time-to-value for new users
For copilots: the time from first use to first measurable productive outcome. Long values indicate onboarding or UX problems, not model problems.
27. Net Promoter Score for the AI feature
Survey users at a stable cadence. A model with great metrics and a deteriorating NPS is hiding something the metrics miss.
Human and process metrics (how it gets run)
28. Time-to-detect incidents
From a failure occurring in production to the team becoming aware of it. Below 1 hour for high-severity issues is the bar.
29. Time-to-resolve
From detection to remediation. Most failures want a rollback first, debug after.
30. Retraining cadence and freshness
How often the model is retrained vs how often new ground-truth data is available. A gap here is a future drift incident waiting to happen.
The dashboard that actually predicts production
Of these 30, a 6-tile executive dashboard tells you 80% of what you need to know:
- Primary business KPI delta (#21) — is the model doing what it was bought for?
- Cost per business outcome (#9) — at what unit cost?
- Hallucination / error rate (#14 or accuracy in classical) — how often is it wrong?
- Adoption rate (#24) — are users actually using it?
- p95 latency (#6) — is the experience acceptable?
- Time-to-detect (#28) — when something goes wrong, how fast do we know?
Six numbers a CTO or CIO can scan in 30 seconds. The remaining 24 metrics live one click deeper for the engineering team. Both layers are necessary. Most teams build the deep dashboard and forget the executive one, then wonder why sponsorship erodes.