Law 27 · Evaluation & Measurement
Look at Your Data
The highest-ROI activity in AI is the one teams skip first.

The principle
Error analysis, reading your app's actual traces by hand to find where it fails, is the single most valuable thing you can do when building with AI, yet teams skip it for dashboards and vanity metrics that climb while users still struggle. You can't write a good eval for a failure mode you've never seen, and you only see failure modes by reading transcripts.
Why it happens
You cannot evaluate failures you have never looked at. Dashboards show counts, but traces show what actually went wrong. The useful loop is simple: read real runs, write notes without forcing them into categories too early, then cluster those notes into recurring failure modes. Those clusters become your evals. Research calls part of this criteria drift: the act of grading outputs reveals what your criteria should have been. If you choose metrics before reading outputs, the numbers can improve while users still feel the system getting worse.
Watch for
- A helpfulness or quality dashboard is climbing while user complaints or churn are not improving.
- Your eval categories were defined before anyone read a single real transcript.
- Nobody on the team can name the top three concrete ways the system actually fails in production.
In practice
Instead of reading transcripts, the team buys an eval platform and watches a 'helpfulness score' dashboard climb while users keep churning. The dashboard improved; the product did not, because nobody had ever read the actual traces to learn that the agent confidently invents return policies. You cannot write an eval for a failure mode you have never witnessed. Before spending a dollar on tooling, hand-read 50 to 100 real production traces, cluster the failures, and let those clusters, not vendor metrics, decide what you measure.
Apply it
- Hand-read a sample of real traces, jotting open notes on each failure before counting anything.
- Cluster those notes into recurring failure categories and let the clusters define what you measure.
- Expect your criteria to shift as you read, and revise the eval set instead of freezing it too early.
The takeaway
Before you buy an eval platform, hand-read 50 to 100 real traces and group the failures. Let those groups decide what you measure.