Law 27 · Evaluation & Measurement

Look at Your Data

The highest-ROI activity in AI is the one teams skip first.

Diagram explaining Look at Your Data

The principle

Error analysis, reading your app's actual traces by hand to find where it fails, is the single most valuable thing you can do when building with AI, yet teams skip it for dashboards and vanity metrics that climb while users still struggle. You can't write a good eval for a failure mode you've never seen, and you only see failure modes by reading transcripts.

Why it happens

You cannot evaluate failures you have never looked at. Dashboards show counts, but traces show what actually went wrong. The useful loop is simple: read real runs, write notes without forcing them into categories too early, then cluster those notes into recurring failure modes. Those clusters become your evals. Research calls part of this criteria drift: the act of grading outputs reveals what your criteria should have been. If you choose metrics before reading outputs, the numbers can improve while users still feel the system getting worse.

Watch for

In practice

Instead of reading transcripts, the team buys an eval platform and watches a 'helpfulness score' dashboard climb while users keep churning. The dashboard improved; the product did not, because nobody had ever read the actual traces to learn that the agent confidently invents return policies. You cannot write an eval for a failure mode you have never witnessed. Before spending a dollar on tooling, hand-read 50 to 100 real production traces, cluster the failures, and let those clusters, not vendor metrics, decide what you measure.

Apply it

  1. Hand-read a sample of real traces, jotting open notes on each failure before counting anything.
  2. Cluster those notes into recurring failure categories and let the clusters define what you measure.
  3. Expect your criteria to shift as you read, and revise the eval set instead of freezing it too early.

The takeaway

Before you buy an eval platform, hand-read 50 to 100 real traces and group the failures. Let those groups decide what you measure.

Sources and further reading

Related laws

Get the audit kit Access the buyer edition Back to all 50 laws