Law 30 · Evaluation & Measurement
Regress or Repeat
Every fixed bug is a future regression unless it becomes a test.

The principle
LLM systems are non-deterministic and globally coupled, so a prompt tweak that fixes one case can quietly break three others. Rerunning real production examples against a new prompt is the only way to know you didn't break what already worked. Without a regression suite you're stuck in a whack-a-mole loop, rediscovering the same failures release after release.
Why it happens
LLM behavior can vary across runs, even when settings look deterministic, and one prompt change can shift many unrelated cases because they share the same instruction surface. That means a local fix is not proof of global safety. You need to rerun the old cases and observe what changed. Every production failure you fix should become a permanent regression case. Otherwise the same bug returns in a new prompt, model, or retrieval setup, and the team keeps rediscovering old failures as if they were new.
Watch for
- A bug you fixed last release has reappeared because nobody re-ran the old case.
- A prompt tweak aimed at one case silently broke a different, unrelated case.
- You ship prompt or model changes without re-running the previously passing examples.
In practice
A user reports the agent mishandles refunds over $1,000, you tweak the prompt, confirm that one case works, and ship. Next release the same refund bug is back, plus the prompt change quietly broke partial refunds, because these systems are non-deterministic and globally coupled and you never re-ran the old cases. Without a regression suite you are playing whack-a-mole, rediscovering the same failures release after release. Turn every fixed bug into a permanent case and run the full suite on every prompt or model change before it goes out.
Apply it
- Turn every fixed bug into a permanent regression case with its expected output.
- Run the full regression suite on every prompt and model change before shipping.
- Because outputs vary run to run, evaluate over repeated runs rather than trusting a single pass.
The takeaway
Every failure you fix becomes a permanent case in your regression eval. Run the full suite on every prompt or model change before you ship.