Law 37 · Architecture & Operations

Cascade Before You Escalate

Try the cheap model first. Only the hard cases deserve the expensive one.

Diagram explaining Cascade Before You Escalate

The principle

Most queries don't need your most powerful model. Routing requests through a cascade, a cheap model first and a stronger one only when confidence is low, can match top-tier quality at a fraction of the cost. The price gap between models spans two orders of magnitude, so paying top dollar for every call is pure waste.

Why it happens

Most requests are easier than your hardest benchmark. A cascade exploits that by trying a cheaper model first and escalating only when a router or validator says the answer is not good enough. FrugalGPT showed large benchmark savings, and later routing work shows the same basic economics. The hard part is not calling the cheap model. It is knowing when to trust it. Self-reported confidence is weak, so validate the router against your own evals. A cascade saves money only if it escalates the cases that truly need help.

Watch for

In practice

Every call in your pipeline hits top-tier pricing, including the 80% of requests that are simple intent classification a small model nails perfectly. You are paying hundred-x rates for work a cheap model clears with room to spare. Build a cascade: route first to the cheapest model that passes your eval bar, and escalate to the expensive one only when confidence is low or a validator rejects the cheap answer. Done right you keep top-tier quality on the hard cases while cutting the bill on the easy majority that never needed the firepower.

Apply it

  1. Answer first with the cheapest model that clears your eval bar, and escalate only on failed or low-signal cases.
  2. Build a deferral check (a validator or learned router) rather than trusting the model's self-reported confidence.
  3. Validate the cascade against a labeled eval set to confirm escalated cases are the ones that actually needed the strong model.

The takeaway

Build a cascade: answer with the cheapest model that clears your eval bar, and escalate only on the low-confidence or failed cases.

Sources and further reading

Related laws

Get the audit kit Access the buyer edition Back to all 50 laws