Law 26 · Evaluation & Measurement
Vibes Don't Scale
Eyeballing outputs feels like progress until you can't tell if a change helped.

The principle
The common root cause of failed LLM products is the absence of solid evals. Teams ship on vibe checks, iterate blind, and can't tell whether a prompt change improved anything. Manual spot-checking doesn't survive scale or a second engineer. Evals are to AI products what unit tests are to software: the up-front cost that makes every later change cheap and safe.
Why it happens
Vibe checks do not repeat. They tell you whether one person liked a few outputs today, not whether the system improved. Generic similarity metrics rarely capture the product-specific thing you care about, so real progress needs task-specific checks you can rerun. The analogy to unit tests is direct: the up-front cost of an eval harness makes every later prompt, model, or retrieval change safer. Without it, you are iterating on memory and taste. In a non-deterministic system, that usually means trading one unseen failure for another.
Watch for
- Prompt changes are judged by eyeballing a few outputs in a playground and nodding.
- Nobody can state whether last week's change actually helped, only that it felt better.
- A second person tweaks the prompt and silently regresses cases nobody re-checked.
In practice
Your team iterates on the summarization prompt by eyeballing a few outputs in the playground, nodding, and shipping. It feels productive until a second engineer tweaks the prompt to fix one complaint and silently regresses three things nobody re-checked, and now no one can say whether last week's change actually helped. Vibe checks do not survive a second person or a tenth example. Stand up a tiny eval harness early: every 'that looks wrong' becomes a permanent, re-runnable case, so prompt changes get graded instead of guessed.
Apply it
- Stand up a small re-runnable eval set before scaling, and run it on every prompt or model change.
- Turn every that looks wrong moment into a permanent test case with an expected outcome.
- Prefer task-specific checks over generic similarity scores, since the latter often fail to track real quality.
The takeaway
Build a small eval harness before you scale. Turn every 'that looks wrong' moment into a permanent, re-runnable test case.