Law 26 · Evaluation & Measurement

Vibes Don't Scale

Eyeballing outputs feels like progress until you can't tell if a change helped.

The principle

The common root cause of failed LLM products is the absence of solid evals. Teams ship on vibe checks, iterate blind, and can't tell whether a prompt change improved anything. Manual spot-checking doesn't survive scale or a second engineer. Evals are to AI products what unit tests are to software: the up-front cost that makes every later change cheap and safe.

Why it happens

Vibe checks do not repeat. They tell you whether one person liked a few outputs today, not whether the system improved. Generic similarity metrics rarely capture the product-specific thing you care about, so real progress needs task-specific checks you can rerun. The analogy to unit tests is direct: the up-front cost of an eval harness makes every later prompt, model, or retrieval change safer. Without it, you are iterating on memory and taste. In a non-deterministic system, that usually means trading one unseen failure for another.

Watch for

Prompt changes are judged by eyeballing a few outputs in a playground and nodding.
Nobody can state whether last week's change actually helped, only that it felt better.
A second person tweaks the prompt and silently regresses cases nobody re-checked.

In practice

Your team iterates on the summarization prompt by eyeballing a few outputs in the playground, nodding, and shipping. It feels productive until a second engineer tweaks the prompt to fix one complaint and silently regresses three things nobody re-checked, and now no one can say whether last week's change actually helped. Vibe checks do not survive a second person or a tenth example. Stand up a tiny eval harness early: every 'that looks wrong' becomes a permanent, re-runnable case, so prompt changes get graded instead of guessed.

Apply it

Stand up a small re-runnable eval set before scaling, and run it on every prompt or model change.
Turn every that looks wrong moment into a permanent test case with an expected outcome.
Prefer task-specific checks over generic similarity scores, since the latter often fail to track real quality.

The takeaway

Build a small eval harness before you scale. Turn every 'that looks wrong' moment into a permanent, re-runnable test case.

Sources and further reading

Get the audit kit Access the buyer edition Back to all 50 laws

The principle

Why it happens

Watch for

Apply it

Sources and further reading

Related laws