Law 25 · Evaluation & Measurement
Averages Lie
97% overall can hide a 60% segment.

The principle
An aggregate metric is a blended story that smooths over exactly the failures you most need to see. A system at 97% overall can be 99% on the easy cases and 60% on the rare, hard segment where the errors actually cluster. Trust the headline number and you'll automate straight into the cracks it's hiding.
Why it happens
A headline score is a blend. It can look excellent while a small but important segment is failing badly: 99% on common easy cases and 60% on rare hard cases can still average near 96%. Errors are rarely uniform. They cluster by language, intent, customer type, field, document format, or edge condition. Random samples often miss those slices because they are rare by definition. Disaggregated evaluation exists to stop that blindness. Slice the score, oversample the risky cases, and make the worst segment visible before you automate.
Watch for
- You are deciding to ship or automate based on one overall accuracy or pass-rate number.
- Your evaluation set is sampled randomly, so rare high-stakes cases barely appear in it.
- You cannot say how the system performs on your worst segment because you have never measured it separately.
In practice
Your support-triage classifier reports 96% accuracy and the team greenlights auto-routing. Three weeks in, the billing-dispute queue is a disaster, because the model was 99% accurate on the common 'password reset' and 'where is my order' tickets and 58% on the rare refund-dispute segment where mistakes actually cost you customers. The blended number hid the exact slice you most needed to see. Slice the eval by ticket type, intent, and language before you trust it, and oversample the rare high-stakes cases instead of grading on a random draw.
Apply it
- Break performance down by type, segment, and field, and require every slice to clear the bar, not just the average.
- Oversample rare and high-stakes cases deliberately instead of relying on a random draw.
- Treat any slice that falls below threshold as a blocker even when the headline number looks healthy.
The takeaway
Slice before you trust. Break performance down by type, segment, and field, and make every slice clear the bar before you act on the average. Sample deliberately for the rare cases, not just at random.