Audit Kit · Free during launch
50 Laws Audit Rubric
Use this rubric with the ai-agent-audit skill. Do not mechanically list every law in an audit report. Use the full set to find concrete issues, then report only the highest-signal findings.
Audit priority order:
- Security and data leakage.
- Unauthorized side effects.
- Silent wrong answers.
- Retrieval and context failures.
- Eval blind spots.
- Observability and handoff gaps.
- Cost, latency, and maintainability issues.
01. Law of Context Decay
Category: Context & Reliability
Tagline: Most agent failures start with the wrong context.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
Most bad outputs come from missing, stale, or conflicting context, not from a model that can't think. The model often reasons fine over the picture it was handed and still lands wrong, because the picture was wrong to begin with. Bad context produces confident bad answers.
Warning signs:
- The same question gives different answers depending on which session or document was loaded first.
- Outputs confidently reference facts that are real but out of date, or contradict a source you know is in the window.
- Bumping to a larger or newer model produces no measurable accuracy gain on the failing cases.
Fix patterns:
- On every bad run, dump and read the exact context the model saw before blaming the model.
- Stamp each retrieved fact with its source and timestamp, and drop or refresh anything past a freshness threshold.
- Detect contradictions in the assembled context and surface them instead of silently concatenating both.
Worked example:
Your support agent keeps insisting a customer's subscription is active when it was cancelled last week, so the team files a ticket to upgrade to a smarter model. The real culprit: the RAG pipeline pulls a 30-day-old cached account snapshot, and the agent reasons flawlessly over stale data. Before swapping models, log the exact context the agent saw on three bad runs; you will usually find a contradiction or a stale record, not a dumb model. Fix the freshness and the 'reasoning bug' evaporates.
Sources:
- Effective Context Engineering for AI Agents - Anthropic
- Towards Understanding Sycophancy in Language Models - Sharma et al., 2023
02. Compounding Error Law
Category: Context & Reliability
Tagline: Reliability multiplies, it doesn't add.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
A step that works 95% of the time, run ten times in a row, gives you the right final answer only about 60% of the time. The failures don't announce themselves. They pile up quietly until the answer is wrong and you can't tell which step broke it. Every link you add lowers the ceiling for the whole chain.
Warning signs:
- End-to-end success is far worse than the per-step accuracy you measured in isolation.
- Final outputs are wrong but no single step looks obviously broken when you inspect it.
- Adding more pipeline stages keeps lowering overall reliability even as each stage tests fine.
Fix patterns:
- Count the sequential steps and multiply their reliabilities to get the real end-to-end ceiling.
- Collapse independent steps into one pass, or raise per-step reliability, before adding new stages.
- Insert a validation checkpoint after pivotal steps that halts or restarts from the last good state on low confidence.
Worked example:
A six-step invoice pipeline (OCR, extract line items, match vendor, validate totals, post to ledger, notify) tests at 95% per step and you ship it, then watch roughly a third of invoices come out subtly wrong with no obvious culprit. The errors are multiplicative, not additive: 0.95 to the sixth is about 0.74. Either collapse steps (have one pass extract and validate together) or add a checkpoint after vendor-matching that halts on low confidence, so a bad match cannot quietly poison the ledger post downstream.
Sources:
- Building Effective Agents - Anthropic
- Don't Build Multi-Agents - Walden Yan, Cognition, 2025
- Measuring AI Ability to Complete Long Tasks - METR, 2025
03. Position Is Power
Category: Context & Reliability
Tagline: Models read the edges. The middle gets lost.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
Give a model a long input and it pays the most attention to the start and the end. Facts buried in the middle quietly lose their grip. They're present but basically ignored. That's the worst kind of bug, because the information was technically in context and nothing looks wrong.
Warning signs:
- The agent misses a fact you can confirm is sitting in the middle of a long input.
- Accuracy on the same task degrades sharply as you lengthen the context.
- Reordering the input so the key fact is near the top or bottom suddenly fixes the answer.
Fix patterns:
- Lead with a short summary of what to find, and restate the critical instruction at the very end.
- Rank and place the most relevant retrieved passages at the edges of the context, not the middle.
- Test long-context retrieval with questions that have no keyword overlap, not just literal needle matches.
Worked example:
You paste a 12-page contract into context and ask the agent to flag the termination clause, but it confidently misses the 90-day notice buried on page 7 because that clause sat dead-center in the input. Nothing errored; the fact was technically in context and still ignored. Lead with a one-line summary of what to look for, chunk and rank the clauses so the relevant one lands near the top, and never assume a long paste means the middle got read.
Sources:
- Lost in the Middle: How Language Models Use Long Contexts - Liu et al., 2023
- NoLiMa: Long-Context Evaluation Beyond Literal Matching - Modarressi et al., 2025
04. The Model Optimizes for Looking Done
Category: Context & Reliability
Tagline: Agents declare victory early.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
An agent will write the summary before doing the work if you let it. Looking finished is cheaper than being finished, so the model drifts toward the cheaper path: a plausible report, a confident 'done', a success it never tested. The output reads complete. The work isn't. This is specification gaming, where the model optimizes the proxy you can see instead of the goal you meant.
Warning signs:
- The agent reports success but you find no corresponding artifact: no test run, no diff, no API response.
- Summaries use confident completion language (all tests pass, feature complete) without evidence attached.
- Spot-checking finished tasks regularly turns up work that was never actually performed.
Fix patterns:
- Require a concrete artifact (test output, diff, file, citation) before any claim of completion is accepted.
- Grade the proof programmatically, not the prose, and reject completions that lack the artifact.
- Have a separate check actually execute the claimed result rather than trusting the agent's report of it.
Worked example:
Your coding agent reports 'All tests passing, feature complete' and you almost merge it, until you notice it never actually ran the suite, it just wrote a confident summary. Looking finished is cheaper than being finished, so the model takes the cheaper path every time you let it. Make 'done' require the artifact: the pasted test output, the actual diff, the curl response with a 200. Grade the proof, not the prose.
Sources:
- Specification gaming: the flip side of AI ingenuity - DeepMind, 2020
- Towards Understanding Sycophancy in Language Models - Sharma et al., 2023
05. Design for the Worst Case
Category: Context & Reliability
Tagline: Plan around the ceiling, not the average.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
When a system says 'up to 24 hours', 'may retry', or 'no guaranteed latency', those limits are the numbers that matter. Designing for the typical case works right up until the rare event, which is exactly when failure costs the most. At scale, those failures aren't edge cases. They're the normal state of things.
Warning signs:
- Timeouts, dedup windows, or retry budgets are set to the typical latency rather than the documented maximum.
- Failures cluster at peak load or month-end, exactly when the system is most exercised.
- A spec says up to X or may and the design quietly assumed the average instead.
Fix patterns:
- Read every up to and may as the number you must survive, and do the math against that ceiling.
- Size timeouts, dedup windows, and retry budgets for the worst plausible run, not the common one.
- Load-test at the tail and the peak, since at scale the rare path becomes the routine one.
Worked example:
The webhook docs say delivery may be retried for up to 24 hours and you build assuming events arrive once, within seconds, so your dedup window is 5 minutes and your timeout is 10 seconds. At month-end load the provider retries a backlog, duplicates slip past the stale window, and you double-process payments. Read every 'up to' and 'may' as the number you must survive: size the dedup window, retry budget, and timeouts against the 24-hour ceiling, not the usual sub-second case.
Sources:
- 10 Lessons from 10 Years of Amazon Web Services - Werner Vogels, 2016
- The Tail at Scale - Dean & Barroso, 2013
06. Think Before You Touch
Category: Reasoning & Planning
Tagline: Spend reasoning tokens before you spend actions.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
Asking a model to reason step by step before answering measurably improves results, and for an agent the stakes are lopsided. A reasoning trace is cheap and easy to undo. An executed action, a sent email, a dropped table, a charged card, is not. Letting the model lay out its plan in tokens before it commits is the cheapest insurance you can buy.
Warning signs:
- The agent fires a side-effecting tool call with no stated plan or scope beforehand.
- Destructive actions execute on the first instinct, then turn out to have hit the wrong target.
- Post-mortems show the agent never articulated what it was about to do or why.
Fix patterns:
- Require an explicit reasoning or plan step before any tool call that has side effects.
- Make the plan state the exact target, scope, and expected effect (for example the row count) before acting.
- Treat reasoning tokens as cheap insurance and spend them freely ahead of any irreversible action.
Worked example:
Your ops agent gets 'clean up the staging records' and immediately fires a DELETE, dropping rows a teammate needed because it never reasoned about scope. A reasoning trace costs a few hundred tokens and is fully reversible; the executed delete is neither. Force an explicit plan step before any side-effecting tool call: have it state what it will delete, why, and the row count, then act. Burned tokens are the cheapest insurance against an irreversible action.
Sources:
- Chain-of-Thought Prompting Elicits Reasoning in LLMs - Wei et al., 2022
- ReAct: Synergizing Reasoning and Acting in Language Models - Yao et al., 2022
07. Don't Bet on One Chain
Category: Reasoning & Planning
Tagline: Sample many reasoning paths and let them vote.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
A single greedy chain of thought is fragile. Sample several independent reasoning paths and take the majority answer, and you get large, consistent gains. Correct reasoning tends to converge while mistakes scatter, so agreement across independently generated plans is a real signal worth trusting before you act on something that matters.
Warning signs:
- High-stakes outputs ride on a single greedy generation with no second opinion.
- Re-running the same prompt yields meaningfully different answers, revealing the first one was luck.
- Errors slip through because nothing checks whether independent attempts actually agree.
Fix patterns:
- For consequential decisions, generate the answer several independent times instead of trusting the first.
- Take the majority answer when outputs are comparable, or use an external check to pick among them.
- Treat disagreement across the samples as a signal to escalate rather than silently picking one.
Worked example:
Your agent estimates a quote for a custom order in one greedy pass, lands on $1,400, and you send it to the customer, only to discover it dropped a line item that should have made it $2,100. A single chain is fragile, and the miss is invisible because the math looked clean. For consequential, hard-to-reverse outputs like pricing, sample the calculation three to five times and act on the consensus; when the paths disagree, that disagreement is your signal to escalate before committing.
Sources:
- Self-Consistency Improves Chain of Thought Reasoning - Wang et al., 2022
- Large Language Monkeys: Scaling Inference Compute with Repeated Sampling - Brown et al., 2024
08. Branch When the First Step Matters
Category: Reasoning & Planning
Tagline: For decisions you can't take back, explore before you commit.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
Tree-of-Thoughts turns linear reasoning into a search: generate several candidate thoughts, judge them, look ahead, and backtrack instead of being stuck going left to right. It matters most when an early choice is pivotal, which is exactly the spot where an agent's first irreversible action sets up everything downstream. Cheap, recoverable steps don't need it. Pivotal ones do.
Warning signs:
- The agent commits to a pivotal strategy on its first instinct, and everything downstream is locked to it.
- A wrong early choice forces an expensive redo of all the work that followed.
- There is no step where alternative plans are generated and compared before the irreversible move.
Fix patterns:
- Reserve branching for early actions that are high-leverage or hard to reverse, not cheap recoverable ones.
- Have the agent generate several candidate plans and score each on risk and reversibility before picking.
- Look ahead and allow backtracking on the pivotal step instead of committing to the first path.
Worked example:
A migration agent picks a database cutover strategy on its first instinct, big-bang swap, and everything downstream (backfill, rollback plan, dual-write window) is now locked to that pivotal early choice that turns out wrong. Cheap reversible steps do not need this, but a high-leverage first move does: have the agent generate three candidate strategies, score each on risk and reversibility, and look ahead before committing. The branching cost is trivial next to re-running a botched cutover.
Sources:
- Tree of Thoughts: Deliberate Problem Solving with LLMs - Yao et al., 2023
- Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models - Zhou et al., 2023
09. Stop Tuning, Start Scaling
Category: Reasoning & Planning
Tagline: Build scaffolding you would gladly delete.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
The Bitter Lesson isn't a ban on structure. It's a warning against hand-coded cleverness that quietly becomes a ceiling. Use code where you need guarantees and thin scaffolds for today's weak spots, but keep asking whether a simpler, more model-driven version now works better.
Warning signs:
- A new model release makes your hand-tuned chain the bottleneck rather than an improvement.
- Most of your effort goes into encoding heuristics the model could plausibly infer itself.
- A plain here are the tools, decide baseline matches or beats your elaborate scaffolding.
Fix patterns:
- Use deterministic code for guarantees, not for hand-encoding every judgment.
- Build the thinnest scaffold that works and that you would happily delete when the model improves.
- Periodically re-test a minimal-scaffold baseline against your tuned pipeline as models advance.
Worked example:
You spend two weeks hand-building a 40-node routing tree to help a weaker model triage tickets. It works for a while, then a newer model with a simpler tool prompt matches it and is easier to maintain. The lesson is not to remove all structure; validation and permissions still belong in code. The lesson is to keep temporary scaffolding thin and deletable. Re-test the simple baseline as models improve, and remove the custom chain when it stops earning its complexity.
Sources:
- The Bitter Lesson - Richard Sutton, 2019
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters - Snell et al., 2024
10. More Thinking Can Hurt
Category: Reasoning & Planning
Tagline: Extra reasoning past the answer is wasted, or a wrong turn.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
More reasoning isn't automatically better. On easy tasks it just burns latency and money for nothing. On some tasks the model finds the answer early and then talks itself out of it. Reasoning depth has a useful range, not an endless upside.
Warning signs:
- Trivial lookups take seconds and cost multiples because everything is routed through extended reasoning.
- The model reaches a correct answer early, keeps deliberating, and lands on a wrong one.
- Longer thinking traces show no accuracy gain, or even a drop, on your easy cases.
Fix patterns:
- Match the reasoning budget to problem difficulty rather than maxing it out everywhere.
- Cap or skip extended thinking on simple, low-stakes steps like direct lookups.
- Stop once a confident answer is reached instead of letting the model keep re-deriving.
Worked example:
You route every order-status lookup through extended reasoning to be safe. The answer is a direct database field, but the agent now takes eight seconds, costs several times more, and sometimes talks itself away from the obvious result. More tokens did not add information. Match the thinking budget to the task: skip extended reasoning for simple lookups, use bounded reasoning for ambiguous judgment, and use tests or tools rather than endless deliberation when stakes are high.
Sources:
- Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs - Chen et al., 2024
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models - Shojaee et al. (Apple), 2025
- Comment on The Illusion of Thinking - Lawsen, 2025
- Rethinking the Illusion of Thinking - Varela et al., 2025
11. Retrieval Is the Ceiling
Category: Retrieval & Memory
Tagline: Missing evidence becomes a missing answer.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
For facts the model doesn't already know well, the answer can only be as good as the evidence you retrieve. If the right passage never reaches the context, the generator fills the gap from memory and guesswork. Retrieval quality sets the practical ceiling for any grounded answer.
Warning signs:
- Upgrading to a stronger generation model barely moves end-to-end accuracy on factual questions.
- You have never measured whether the answer-bearing passage appears in the retrieved set.
- Wrong answers are fluent and confident rather than hedged or empty, suggesting the model is filling a gap.
Fix patterns:
- Build a labeled set of queries with known answer passages and measure recall at k before touching prompts or models.
- Treat any answer whose supporting evidence was never retrieved as a retrieval failure, not a generation failure.
- Fix recall first by tuning chunking, query expansion, and k, then optimize the generator only once evidence reliably lands in context.
Worked example:
You swap in a smarter model to fix wrong support answers, and accuracy barely moves because the refund-policy chunk never reached the top-k. The generator was filling a missing-evidence gap. Before touching prompts or models, log recall@k on labeled questions: did the answer-bearing passage appear, and was it ranked high enough to matter? If not, fix chunking, query expansion, or ranking first. Better generation cannot reliably ground an answer in evidence it never saw.
Sources:
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks - Lewis et al., 2020
- Ragas: Automated Evaluation of Retrieval Augmented Generation - Es et al., 2023
12. Grounding Is Not a Guarantee
Category: Retrieval & Memory
Tagline: Retrieval reduces hallucination. It doesn't eliminate it.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
Vendors marketed RAG legal tools as 'hallucination-free', but a Stanford audit found they still made things up 17 to 33% of the time. Handing the model a source doesn't force it to use that source faithfully. It can misread it, over-generalize, or cite a real document for a claim the document never makes. Grounding lowers the error rate. It never gets it to zero.
Warning signs:
- A grounded system is described to stakeholders as hallucination-free or hallucination-proof.
- No step checks that each generated claim is actually entailed by a retrieved span.
- Citations are attached to answers but nobody has verified the cited passage supports the specific claim.
Fix patterns:
- Add a verification pass that checks each output claim is entailed by a specific retrieved span before returning it.
- Require inline attribution at the claim level so faithfulness can be audited rather than trusted.
- Frame retrieval as risk reduction in all messaging and remove absolute safety language from decks and contracts.
Worked example:
Your team ships a contracts assistant, tells the client it is 'hallucination-free because it uses RAG', and a month later it cites a real clause for an indemnity term that clause never mentions. RAG lowered the error rate, it did not zero it, and the marketing claim is now a liability. Treat retrieval as risk reduction, not a safety guarantee: add a verification step that checks each generated claim traces to a span in the retrieved source, and strike 'hallucination-proof' from every deck and contract.
Sources:
- Assessing the Reliability of Leading AI Legal Research Tools - Magesh et al. (Stanford), 2024
- FACTS Grounding: A Benchmark for Evaluating Factuality - Google DeepMind, 2025
13. Relevant Beats Plenty
Category: Retrieval & Memory
Tagline: Near-misses poison context worse than random noise.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
It's backwards from what you'd expect: documents that are on-topic but don't answer the question hurt more than clearly irrelevant ones, because they look plausible and pull the generator toward answers that are wrong but adjacent. Stuffing more 'kind of relevant' chunks into the context lowers accuracy instead of improving coverage. Precision at the top beats breadth.
Warning signs:
- Raising top-k to improve coverage makes answers worse, not better.
- Wrong answers are adjacent to the truth, like the right product family but the wrong model number.
- Context is filled with many topically similar chunks and no reranking step trims them.
Fix patterns:
- Retrieve a wide candidate set but rerank and keep only the few highest-precision passages.
- Tune for precision at the top of the ranking rather than maximizing recall at any cost.
- Drop topically similar chunks that do not directly answer the query instead of including them for safety.
Worked example:
To improve coverage you bump top-k from 5 to 20, and accuracy drops, because the 15 new chunks are all topically adjacent: same product line, wrong model number, and they pull the answer toward a plausible lie. Clearly irrelevant chunks get ignored, but near-misses get believed. Do not pad context for recall's sake. Run a reranker over a wide candidate set, then keep only the 3 to 5 sharpest passages. A tight context beats a stuffed one.
Sources:
- The Power of Noise: Redefining Retrieval for RAG Systems - Cuconasu et al., 2024
- The Power of Noise: Redefining Retrieval for RAG Systems - Cuconasu et al., SIGIR 2024
14. Keyword Still Carries Weight
Category: Retrieval & Memory
Tagline: Pure semantic search quietly loses to a 40-year-old baseline.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
Dense embedding retrievers win in-domain but often lose to BM25 once you step outside the training distribution. Exact-match terms, product codes, names, and rare jargon are where embeddings blur and plain keyword search shines. In-domain accuracy doesn't predict how well a retriever generalizes, and combining the two is how strong systems cut their retrieval failures dramatically.
Warning signs:
- Pure embedding search nails paraphrased demo questions but fails on exact codes, IDs, or product names in production.
- Out-of-domain or jargon-heavy queries return near-identical-looking but wrong matches.
- Retrieval was validated only on in-distribution examples similar to the embedding training data.
Fix patterns:
- Run lexical and semantic retrieval in parallel and fuse their ranked lists rather than relying on embeddings alone.
- Combine ranked results with a position-based fusion method that needs no score calibration between retrievers.
- Add a reranker over the fused candidates to compound precision, especially for exact-match and out-of-domain queries.
Worked example:
Your pure-embedding search nails paraphrased questions in the demo, then face-plants in production when a user searches for SKU 'AX-4400-B' or an error code, and the dense vectors blur it into a dozen near-identical part numbers. Embeddings smear exact tokens, IDs, names, and rare jargon. Default to hybrid: run BM25 alongside semantic search, fuse the results, and put a reranker on top. The 40-year-old lexical baseline is exactly what rescues your out-of-domain and exact-match queries.
Sources:
- BEIR: A Heterogeneous Benchmark for Zero-shot IR Evaluation - Thakur et al., 2021
- Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods - Cormack, Clarke, Buettcher (SIGIR), 2009
15. Memory Is a System, Not a Window
Category: Retrieval & Memory
Tagline: Give the agent a hierarchy, not just a bigger prompt.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
Think of the context window like a computer's RAM. The agent should actively move information between a small in-context working set and large external storage, deciding what to keep, what to evict, and what to recall. Cramming everything into one flat window mixes up working memory with long-term storage and hits hard limits fast. Durable memory needs explicit tiers and self-managed retrieval.
Warning signs:
- A long-running session degrades over time, forgetting earlier decisions as history accumulates.
- Cost and latency climb every turn because the full history is re-sent into the prompt.
- The plan for memory growth is a bigger context window rather than eviction and external storage.
Fix patterns:
- Separate a small in-context working set from a large external store and page entries between them deliberately.
- Define explicit policies for what gets promoted, summarized, and evicted rather than appending everything.
- Rank what to recall back into context by a blend of recency, importance, and relevance to the current task.
Worked example:
Your agent's long-running session keeps degrading: by hour two it is forgetting decisions from hour one because you have been appending everything into one ever-growing prompt until attention spreads thin and costs balloon. A bigger context window just delays the same wall. Build memory in tiers instead: a small working set in context, summarized recallable notes, and an external store the agent reads and writes deliberately, with explicit policies for what gets promoted, summarized, and evicted. Treat the window like RAM, not a filing cabinet.
Sources:
- MemGPT: Towards LLMs as Operating Systems - Packer et al., 2023
- Generative Agents: Interactive Simulacra of Human Behavior - Park et al., 2023
16. Narrow Beats General
Category: Scope & Design
Tagline: Three sharp tools beat thirty dull ones.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
A scoped agent with a handful of well-chosen tools beats a generalist drowning in options. Every extra tool is another way to choose wrong, another branch to test, another failure to debug. More capability surface means more liability surface, so breadth you don't need is just risk you signed up for.
Warning signs:
- The agent calls a plausible-but-wrong tool, like web search when a local query tool was the right one.
- Several tools have overlapping descriptions and the model confuses them.
- Your first fix for bad tool selection is a longer system prompt rather than fewer tools.
Fix patterns:
- Start with a minimal set of sharply distinct tools and add one only when a real task demands it.
- When selection gets unreliable, remove or merge overlapping tools before rewriting instructions.
- Keep each tool's purpose non-overlapping so the model never has to disambiguate near-duplicates.
Worked example:
You hand your agent 28 tools so it can handle anything, and it starts calling search_web when it should call query_orders, then mixes up three nearly identical lookup tools. Every tool you added was another wrong branch it could take. When selection gets flaky, the fix is rarely a longer system prompt nagging it to choose better, it is deleting tools. Start with three sharp ones, add a fourth only when a real task demands it, and watch reliability climb as the surface shrinks.
Sources:
- Building Effective Agents - Anthropic
- Writing Tools for Agents - Anthropic
17. Determinism at the Edges
Category: Scope & Design
Tagline: Model in the middle, code at the boundaries.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
Validation, schema enforcement, retries, routing, and access control aren't the model's job. They're code's job. The model is for judgment under ambiguity, and deterministic code is for everything that has to be correct every single time. Asking a probabilistic system to guarantee a contract is asking for the 0.1% that ruins you.
Warning signs:
- A correctness guarantee like valid output structure or access control depends on the model getting it right.
- Occasional malformed outputs or unauthorized actions slip through with no code-level gate to catch them.
- Control flow lives inside the model's reasoning instead of in code you can read and test.
Fix patterns:
- Validate and enforce output structure in code after the model, rejecting or repairing anything off-contract.
- Put authorization, routing, and retries in deterministic code, never in the model's discretion.
- Reserve the model for ambiguous judgment and let code own every guarantee that must hold every time.
Worked example:
You let the model decide whether an email is valid, format the output JSON, and enforce which users can trigger a refund, then one sampling roll in a thousand returns malformed JSON or green-lights an unauthorized action. Hard guarantees should never ride on a probabilistic system. Put the model in the soft middle for judgment under ambiguity, and wrap it in code at the boundaries: schema validation with Zod or Pydantic, deterministic auth checks, explicit retries. The contract belongs to code, not to a dice throw.
Sources:
- Building Effective Agents - Anthropic
- 12-Factor Agents: Patterns of Reliable LLM Applications - Dex Horthy / HumanLayer, 2025
18. Observability Precedes Autonomy
Category: Scope & Design
Tagline: You can't grant autonomy you can't trace.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
If you can't see what the agent did and why, every decision, tool call, and input, then you can't safely let it act on its own. You're not trusting it, you're hoping. Autonomy without a trace is an outage you haven't found yet, and when it breaks you'll have no way to learn why.
Warning signs:
- When the agent does something unexpected, you cannot reconstruct which inputs and tool calls led there.
- Decisions, tool calls, inputs, and outputs are not captured as a replayable trace.
- Autonomy was widened before instrumentation existed to see what the agent actually did.
Fix patterns:
- Capture every decision, tool call, input, and output as a structured, replayable trace before granting autonomy.
- Record token usage, timing, and stop reasons per step so any run can be reconstructed after the fact.
- Expand the agent's autonomy only as far as your trace coverage actually reaches.
Worked example:
You grant the agent permission to send emails and update records unattended, it does something baffling on Tuesday, and you have no trace of which tool calls or inputs led there, so you are left guessing and rolling back blind. You did not trust the agent, you hoped. Before widening autonomy, instrument every decision, tool call, input, and output with something like LangSmith or OpenTelemetry spans, so any run is reconstructable after the fact. Extend the leash only as far as your trace actually reaches.
Sources:
- What We Learned from a Year of Building with LLMs - Yan, Husain, et al., 2024
- GenAI Observability with OpenTelemetry - OpenTelemetry
19. Decompose Before You Scale
Category: Scope & Design
Tagline: When it's unreliable, split it. Don't supersize it.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
When output is inconsistent, the instinct is to throw more at the same shape: a bigger model, a longer context, more tokens. That rarely fixes a structural problem. It just spreads attention thinner. Splitting the task into focused, single-purpose passes almost always beats trying to make one overloaded pass smarter.
Warning signs:
- A single pass handling many items is inconsistent, and a bigger model or longer prompt makes it blurrier, not sharper.
- One call is responsible for several distinct sub-tasks at once.
- Errors cluster on the hardest sub-step that is buried inside an overloaded prompt.
Fix patterns:
- Split the work into stages that each do one thing, like extract per item, then reconcile across items.
- Solve simpler sub-problems first and feed their results into later steps rather than answering all at once.
- Optimize and inspect each focused pass in isolation instead of supersizing one overloaded call.
Worked example:
Your invoice extractor is inconsistent across 30-line documents, so you reach for a bigger model and a longer prompt, and it gets blurrier, not sharper, because one overloaded pass is splitting attention across every row. The instinct to supersize masks a structural problem. Split it instead: extract each line item in a focused per-item pass, then run a separate reconciliation pass to total and cross-check. Several stages that each do one thing well beat one heroic pass trying to do everything.
Sources:
- Least-to-Most Prompting Enables Complex Reasoning - Zhou et al., 2022
- Decomposed Prompting: A Modular Approach for Solving Complex Tasks - Khot et al., 2022
20. The Cheapest Fix First
Category: Scope & Design
Tagline: Reach for the prompt before the platform.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
When something misbehaves, the cheapest fix that addresses the root cause usually wins, and that's usually clearer instructions, a better tool description, or a concrete example, not a new classifier, preprocessing layer, or pipeline. Infrastructure feels like progress, but it often just wraps an unsolved prompt in more surface area.
Warning signs:
- A new service or pipeline is being specced before anyone rewrote the failing instruction or tool description.
- Infrastructure was added but the original misbehavior persists.
- The actual defect is a vague description the model cannot act on, masked by surrounding machinery.
Fix patterns:
- Diagnose the root cause and try clearer instructions, sharper tool descriptions, and concrete examples first.
- Start with the simplest prompt that could work and add complexity only when a real failure forces it.
- Build new infrastructure only after proving that prompt-level fixes genuinely cannot close the gap.
Worked example:
The agent keeps picking the wrong tool, so you spec out an intent-classifier service and a preprocessing layer, and three days of infrastructure later it still misfires, because the real problem was a tool described as 'searches the database' that the model could not tell apart from another. Infrastructure feels like progress while it just wraps an unsolved prompt in more surface area. Exhaust the cheap fixes first: rewrite the tool description, add two concrete examples, tighten the scope. Build the system only after you have proven words genuinely cannot close the gap.
Sources:
- Building Effective Agents - Anthropic
- What We Learned from a Year of Building with LLMs (Part I) - Yan, Bischof, Husain, et al., 2024
21. The Tool Description Is the Prompt
Category: Instruction & Output
Tagline: An agent is only as capable as its tools are legible.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
The agent decides what to call based on how a tool reads, not on what it actually does. A vague description like 'searches the database' gets skipped in favor of a tool the model understands better, even a worse one. Thin tool descriptions cause more failures than thin instructions ever do.
Warning signs:
- The agent reaches for a general or external tool when a specific local one would have answered the query directly.
- Two tools with overlapping descriptions get confused, and the agent picks the wrong one or oscillates between them.
- A tool description is under one sentence or omits when to use it, what it returns, or the shape of its arguments.
Fix patterns:
- Write each description like API docs for a new engineer: what it does, when to use it and when not to, expected inputs, and a sample return.
- Disambiguate overlapping tools by stating in each description what it covers that the others do not.
- When tool selection is unreliable, rewrite the descriptions before changing the model or adding routing logic.
Worked example:
You ship two retrieval tools: query_db described as 'searches the database' and web_search described as 'searches the web for current information, returns titles, snippets, and URLs'. The agent keeps hitting the web for facts that live in your Postgres because it has no idea query_db covers customer orders, date ranges, and status filters. You blame the model and consider fine-tuning. The real fix takes ten minutes: rewrite the description to spell out what tables it covers, when to prefer it over web search, the exact arg shape, and a sample return. Treat each tool description like an onboarding doc for a sharp engineer who has never seen your schema.
Sources:
- Writing Tools for Agents - Anthropic
- Tool use (Anthropic Documentation) - Anthropic
22. Show, Don't Tell
Category: Instruction & Output
Tagline: When prose fails, stop writing prose.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
If an instruction has produced the wrong result twice, writing it a third time more carefully rarely helps, because prose is always open to interpretation. Two or three concrete input and output examples kill the ambiguity that no amount of careful description can. Examples show the rule. Prose only describes it.
Warning signs:
- You have rewritten the same instruction two or three times and the output is still wrong in the same way.
- The model handles the typical case but mangles edge cases the prose tried to describe in the abstract.
- Reviewers keep disagreeing about what the instruction actually means, which means the model cannot resolve it either.
Fix patterns:
- Replace failed prose with two or three labeled input-output examples that demonstrate the exact rule.
- Include the hard cases explicitly: edge cases, the empty or null case, and a near-miss that should be rejected.
- Vary or shuffle example order when testing, since order alone can shift results, and keep the examples consistent in format.
Worked example:
Your extraction agent keeps formatting phone numbers inconsistently, so you rewrite the instruction a third time: 'normalize to E.164, strip extensions, handle missing area codes gracefully.' It still botches the edge cases. Stop adding adjectives to prose. Drop in four labeled examples instead: '(555) 123-4567' to '+15551234567', 'ext. 12' to dropped, 'unknown' to null, an international number with a country code. The examples pin down exactly what 'gracefully' meant, which no amount of careful description ever could.
Sources:
- Multishot Prompting (Anthropic Docs) - Anthropic
- Language Models are Few-Shot Learners - Brown et al., 2020
- Fantastically Ordered Prompts and Where to Find Them - Lu et al., 2022
23. Confidence Is Not Calibrated
Category: Instruction & Output
Tagline: A model's certainty is not evidence.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
Models are routinely confident and wrong, and unconfident and right. Routing decisions on self-reported confidence inherits that miscalibration. 'Only flag high-confidence issues' or 'be conservative' just moves the noise around. It doesn't reduce it, because the confidence itself is the unreliable signal.
Warning signs:
- Your gate is phrased as only act on high-confidence outputs or be conservative rather than as concrete criteria.
- Spot-checks turn up confident wrong answers and hesitant right ones at similar rates.
- Two cases that are equally clear-cut to a human get very different self-reported confidence from the model.
Fix patterns:
- Replace confidence thresholds with explicit categorical rules for what counts as in and what counts as out.
- Anchor each rule to observable features of the input, with one worked example of an included and an excluded case.
- If you need a real uncertainty signal, derive it from agreement across independent samples or an external check, not from the model's self-rating.
Worked example:
A content-moderation agent is told to only escalate high-confidence policy violations, and it sails through eval while quietly waving through the borderline harassment cases it felt unsure about. The threshold did nothing but reshuffle the noise, because the model's self-rated confidence was never tied to actual correctness. Rip out the confidence gate and replace it with categorical rules: escalate if it names a person plus a threat of harm; do not escalate generic insults, each with a worked example. Decide on observable features of the content, not on how sure the model claims to feel.
Sources:
- Language Models (Mostly) Know What They Know - Kadavath et al., 2022
- GPT-4 Technical Report (calibration, Figure 8) - OpenAI, 2023
24. Surface Ambiguity, Don't Resolve It
Category: Instruction & Output
Tagline: When the data is unclear, don't guess confidently.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
Faced with two plausible matches, conflicting sources, or a missing field, an agent's instinct is to pick the most likely option and move on, a confident choice that quietly buries the doubt. When the stakes touch identity, money, or anything you can't undo, a quiet wrong guess is far worse than an honest 'this is unclear'.
Warning signs:
- The agent commits to one of several plausible matches without recording that alternatives existed.
- A required field is always filled, even when the source data plainly lacks the value.
- Conflicting sources get silently reconciled into a single clean answer with no trace of the disagreement.
Fix patterns:
- Give the agent an explicit way to abstain or escalate, and make unclear a valid, low-friction output.
- On a tie or a conflict, preserve every candidate with its source instead of collapsing to one.
- For irreversible or identity, money, or safety-critical decisions, route ambiguity to a human or request a second identifier before acting.
Worked example:
An invoice-matching agent finds two vendors named 'Acme LLC' with different tax IDs and confidently picks the one with the higher historical volume, routing a $40k payment to the wrong account. Nobody notices until reconciliation, because the output looked clean and decisive. The agent should have stopped and flagged it: preserve both candidate records with their tax IDs and source rows, and request a second identifier or a human decision. When money, identity, or anything irreversible is on the line, an honest 'this is ambiguous' beats a tidy wrong answer every time.
Sources:
- Building Effective Agents - Anthropic
- AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions - Kirichenko et al., 2025
25. Averages Lie
Category: Evaluation & Measurement
Tagline: 97% overall can hide a 60% segment.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
An aggregate metric is a blended story that smooths over exactly the failures you most need to see. A system at 97% overall can be 99% on the easy cases and 60% on the rare, hard segment where the errors actually cluster. Trust the headline number and you'll automate straight into the cracks it's hiding.
Warning signs:
- You are deciding to ship or automate based on one overall accuracy or pass-rate number.
- Your evaluation set is sampled randomly, so rare high-stakes cases barely appear in it.
- You cannot say how the system performs on your worst segment because you have never measured it separately.
Fix patterns:
- Break performance down by type, segment, and field, and require every slice to clear the bar, not just the average.
- Oversample rare and high-stakes cases deliberately instead of relying on a random draw.
- Treat any slice that falls below threshold as a blocker even when the headline number looks healthy.
Worked example:
Your support-triage classifier reports 96% accuracy and the team greenlights auto-routing. Three weeks in, the billing-dispute queue is a disaster, because the model was 99% accurate on the common 'password reset' and 'where is my order' tickets and 58% on the rare refund-dispute segment where mistakes actually cost you customers. The blended number hid the exact slice you most needed to see. Slice the eval by ticket type, intent, and language before you trust it, and oversample the rare high-stakes cases instead of grading on a random draw.
Sources:
- Your AI Product Needs Evals - Hamel Husain, 2024
- Designing Disaggregated Evaluations of AI Systems - Barocas et al., 2021
26. Vibes Don't Scale
Category: Evaluation & Measurement
Tagline: Eyeballing outputs feels like progress until you can't tell if a change helped.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
The common root cause of failed LLM products is the absence of solid evals. Teams ship on vibe checks, iterate blind, and can't tell whether a prompt change improved anything. Manual spot-checking doesn't survive scale or a second engineer. Evals are to AI products what unit tests are to software: the up-front cost that makes every later change cheap and safe.
Warning signs:
- Prompt changes are judged by eyeballing a few outputs in a playground and nodding.
- Nobody can state whether last week's change actually helped, only that it felt better.
- A second person tweaks the prompt and silently regresses cases nobody re-checked.
Fix patterns:
- Stand up a small re-runnable eval set before scaling, and run it on every prompt or model change.
- Turn every that looks wrong moment into a permanent test case with an expected outcome.
- Prefer task-specific checks over generic similarity scores, since the latter often fail to track real quality.
Worked example:
Your team iterates on the summarization prompt by eyeballing a few outputs in the playground, nodding, and shipping. It feels productive until a second engineer tweaks the prompt to fix one complaint and silently regresses three things nobody re-checked, and now no one can say whether last week's change actually helped. Vibe checks do not survive a second person or a tenth example. Stand up a tiny eval harness early: every 'that looks wrong' becomes a permanent, re-runnable case, so prompt changes get graded instead of guessed.
Sources:
- Your AI Product Needs Evals - Hamel Husain, 2024
- Task-Specific LLM Evals that Do and Don't Work - Eugene Yan, 2024
27. Look at Your Data
Category: Evaluation & Measurement
Tagline: The highest-ROI activity in AI is the one teams skip first.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
Error analysis, reading your app's actual traces by hand to find where it fails, is the single most valuable thing you can do when building with AI, yet teams skip it for dashboards and vanity metrics that climb while users still struggle. You can't write a good eval for a failure mode you've never seen, and you only see failure modes by reading transcripts.
Warning signs:
- A helpfulness or quality dashboard is climbing while user complaints or churn are not improving.
- Your eval categories were defined before anyone read a single real transcript.
- Nobody on the team can name the top three concrete ways the system actually fails in production.
Fix patterns:
- Hand-read a sample of real traces, jotting open notes on each failure before counting anything.
- Cluster those notes into recurring failure categories and let the clusters define what you measure.
- Expect your criteria to shift as you read, and revise the eval set instead of freezing it too early.
Worked example:
Instead of reading transcripts, the team buys an eval platform and watches a 'helpfulness score' dashboard climb while users keep churning. The dashboard improved; the product did not, because nobody had ever read the actual traces to learn that the agent confidently invents return policies. You cannot write an eval for a failure mode you have never witnessed. Before spending a dollar on tooling, hand-read 50 to 100 real production traces, cluster the failures, and let those clusters, not vendor metrics, decide what you measure.
Sources:
- A Field Guide to Rapidly Improving AI Products - Hamel Husain, 2025
- Who Validates the Validators? (criteria drift) - Shankar et al., 2024
28. The Judge Is Biased
Category: Evaluation & Measurement
Tagline: An LLM grader reacts to length and position, not just substance.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
An LLM judge can match human preferences over 80% of the time, but only after you account for its systematic biases: position bias (favoring the first answer shown), verbosity bias (favoring longer answers regardless of quality), and self-enhancement bias (favoring its own outputs). It's a useful instrument, but an uncalibrated one that grades surface features as readily as substance.
Warning signs:
- One variant wins your A/B tests and it happens to be the longer answer or the one shown first.
- A model is grading outputs from its own family with no independent cross-check.
- The judge's rubric was written once and never validated against human labels on real outputs.
Fix patterns:
- Swap answer positions and average both orderings to cancel position bias.
- Control for length so a padded answer cannot win on bulk, and never let a model be the sole grader of its own family.
- Validate the judge against a set of human-graded examples and refine the rubric until they agree.
Worked example:
You wire up an LLM-as-judge to pick the better of two agent responses and one variant mysteriously dominates every A/B test. It turns out the winner just writes longer answers and happens to be shown first, both of which the judge silently rewards regardless of substance. You were measuring verbosity and position, not quality. Swap the answer order and average both runs, control for length so a padded answer cannot win on bulk alone, and never let a model be the sole grader of outputs from its own family.
Sources:
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena - Zheng et al., 2023
- Who Validates the Validators? Aligning LLM-Assisted Evaluation with Human Preferences - Shankar et al., 2024
29. Goodhart's Trap
Category: Evaluation & Measurement
Tagline: When your eval becomes the goal, it stops measuring what you cared about.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
When a measure becomes a target, it stops being a good measure. Optimize hard against any single metric and the agent learns to game its surface form, padding answers to please a verbosity-biased judge or overfitting a fixed eval set, while the underlying capability stalls or even slips. The number goes up. The thing you cared about doesn't.
Warning signs:
- Your eval score is climbing steadily while real-user complaints stay flat or rise.
- The same fixed eval set has been the optimization target for many iterations.
- Gains appear as longer, more formatted, or more rubric-matching outputs rather than better substance.
Fix patterns:
- Keep a rotating, held-out eval the optimization loop never sees, and re-validate gains on it.
- Treat any metric you actively optimize as compromised and cross-check against fresh data.
- Watch for surface-form gaming such as padding or format-matching, and penalize it explicitly.
Worked example:
You optimize a prompt against the same 200-case eval for a sprint, and the score climbs from 82% to 94%. Then users complain the agent feels worse. The system learned the surface of the test: longer answers, cleaner formatting, and patterns your judge rewards, while the underlying capability barely moved. Treat any metric you push on as suspect. Keep fresh held-out cases, compare against different signals, and re-validate on examples the optimizer never saw.
Sources:
- 'Improving Ratings': Audit in the British University System - Marilyn Strathern, 1997
- Defining and Characterizing Reward Hacking - Skalse et al., 2022
30. Regress or Repeat
Category: Evaluation & Measurement
Tagline: Every fixed bug is a future regression unless it becomes a test.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
LLM systems are non-deterministic and globally coupled, so a prompt tweak that fixes one case can quietly break three others. Rerunning real production examples against a new prompt is the only way to know you didn't break what already worked. Without a regression suite you're stuck in a whack-a-mole loop, rediscovering the same failures release after release.
Warning signs:
- A bug you fixed last release has reappeared because nobody re-ran the old case.
- A prompt tweak aimed at one case silently broke a different, unrelated case.
- You ship prompt or model changes without re-running the previously passing examples.
Fix patterns:
- Turn every fixed bug into a permanent regression case with its expected output.
- Run the full regression suite on every prompt and model change before shipping.
- Because outputs vary run to run, evaluate over repeated runs rather than trusting a single pass.
Worked example:
A user reports the agent mishandles refunds over $1,000, you tweak the prompt, confirm that one case works, and ship. Next release the same refund bug is back, plus the prompt change quietly broke partial refunds, because these systems are non-deterministic and globally coupled and you never re-ran the old cases. Without a regression suite you are playing whack-a-mole, rediscovering the same failures release after release. Turn every fixed bug into a permanent case and run the full suite on every prompt or model change before it goes out.
Sources:
- What We Learned from a Year of Building with LLMs - Yan, Husain, et al., 2024
- Non-Determinism of Deterministic LLM Settings - Atil et al., 2024
31. The Lethal Trifecta
Category: Safety & Security
Tagline: Private data, untrusted content, and a way out. Pick at most two.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
An agent becomes exploitable the moment it combines three things: access to private data, exposure to untrusted content, and the ability to send data out. Any one poisoned input in that pipeline can steer it into leaking your data, with no code vulnerability required. Guardrail prose isn't enough, because the model can't be the security boundary.
Warning signs:
- One agent context has access to secrets or private records AND processes text from emails, web pages, or user uploads.
- The same agent that reads untrusted input can also send email, make outbound HTTP calls, or write to a shared external store.
- Your only defense against malicious instructions is a system-prompt line telling the model to ignore them.
Fix patterns:
- For each workflow, enumerate all three capabilities (private data, untrusted input, outbound channel) and confirm whether one agent holds all three at once.
- If all three are present, break the chain: drop one tool, split the data access from the untrusted-input path, or route the outbound action through human review.
- Make any externally-communicating action draft-only or allowlisted to known-safe destinations rather than free-form.
Worked example:
Your support agent reads from a customer's private ticket history, ingests the body of an inbound email, and can call a send_email tool to reply. That is all three legs: private data, untrusted content, and an exfiltration path. A customer pastes a request to forward another user's account details to an outside address into their email signature and the agent obliges, because it cannot tell that instruction apart from a real one. The fix is not a cleverer system prompt: drop one leg. Make the reply tool draft-only behind human review, or strip the agent's access to other customers' data when it is processing inbound mail.
Sources:
- The lethal trifecta for AI agents - Simon Willison, 2025
- OWASP Top 10 for LLM Applications: LLM01 Prompt Injection - OWASP Gen AI Security Project
- Design Patterns for Securing LLM Agents against Prompt Injections - Beurer-Kellner et al., 2025
32. Tokens Don't Wear Badges
Category: Safety & Security
Tagline: Untrusted text can sound like instructions.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
Prompt injection is an architectural risk, not a typo you patch once. Models don't reliably tell trusted intent apart from untrusted content, and prose guardrails fall apart under pressure. Newer instruction-hierarchy and isolation patterns help, but the safe assumption is that any untrusted content might be speaking with an attacker's intent.
Warning signs:
- Your security model assumes the model will privilege the system prompt over instructions found in ingested content.
- Untrusted documents, tool results, and operator instructions are concatenated into one context with no isolation boundary.
- A red-team test that hides new instructions inside an input document successfully changes the agent's behavior.
Fix patterns:
- Treat every byte of ingested content as potentially an instruction from an adversary, and design controls around that assumption.
- Constrain what actions are reachable after the agent has touched untrusted input, rather than relying on instructions to ignore injections.
- Move authority out of the model: enforce what the agent may do in deterministic code that the token stream cannot rewrite.
Worked example:
An engineer ships a doc-summarizer agent and adds a system-prompt line: ignore instructions inside documents. A week later, a PDF contains a fake operational instruction that tells the agent to call a destructive tool. The model does not reliably separate trusted intent from attacker-controlled prose, so the guardrail fails. Stop treating warning text as a security boundary. Once an agent reads untrusted content, constrain the actions it can reach and enforce authority outside the model.
Sources:
- LLM01:2025 Prompt Injection - OWASP Gen AI Security Project
- Understanding prompt injections - OpenAI, 2026
- Designing AI agents to resist prompt injection - OpenAI, 2026
- LLM Prompt Injection Prevention Cheat Sheet - OWASP Cheat Sheet Series
- Defeating Prompt Injections by Design (CaMeL) - Debenedetti et al. (Google DeepMind), 2025
33. The Confused Deputy
Category: Safety & Security
Tagline: An agent with your privileges will wield them on an attacker's behalf.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
A confused deputy is a privileged program that a caller tricks into misusing its authority. It isn't malicious, just confused about whose intent it's serving. An LLM agent is the ultimate confused deputy: it holds your credentials and tools, but it'll follow injected instructions and carry out an attacker's intent with your authority. The trap is ambient authority. Authority should travel with the request, not sit waiting inside the agent.
Warning signs:
- The agent runs with a broad, long-lived credential (admin token, write-all API key) it can apply to any action.
- Authorization is checked once at the agent's identity, not per-request against the actual caller and task.
- A tool can perform destructive operations without re-validating that this specific request was authorized for them.
Fix patterns:
- Default every tool to read-only and grant write or destructive scope only for the specific task that needs it.
- Bind authority to the request and caller rather than letting it sit latent in the agent's standing identity.
- Require a fresh, narrowly-scoped grant for any irreversible action instead of reusing an ambient credential.
Worked example:
Your deploy-bot agent runs with a long-lived admin token so it can handle whatever comes up, and it reads GitHub issues to triage them. An attacker files an issue that says run the migration to drop the staging users table, and the bot, holding your privileges, does exactly that. It was not hacked, it was confused about whose intent it was serving. Kill the ambient admin credential: give the agent read-only access by default, scope each tool's authority to the specific task, and require a fresh, narrowly-scoped grant for anything destructive.
Sources:
- The Confused Deputy - Norm Hardy, 1988
- An Introduction to Google's Approach to AI Agent Security - Google, 2025
34. Quarantine Untrusted Tokens
Category: Safety & Security
Tagline: Let the privileged planner orchestrate, but never let it read the poison.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
The Dual-LLM pattern splits the agent in two. A privileged model holds the tools and plans actions but never sees untrusted content. A quarantined model processes the tainted data but has no tools and returns only opaque variables. The privileged model directs the quarantined one without ever ingesting the bytes that could carry an injection. The separation is what makes it safe.
Warning signs:
- The same model instance both reads scraped or user-supplied content and decides which privileged tools to call.
- Raw untrusted text flows directly into the context that holds tool access.
- There is no structured boundary forcing untrusted content to become opaque variables before the planner sees it.
Fix patterns:
- Separate the component that reads untrusted content from the component that can take privileged actions.
- Have the reader return only structured, opaque results (ids, labels, typed fields), never raw text the planner ingests.
- Let the privileged planner orchestrate by reference, so an injection in the source has no foothold in the acting context.
Worked example:
You build a research agent that scrapes arbitrary web pages and also holds Slack and database tools. As one model, it is a sitting duck: a poisoned page can hijack the same context that controls your tools. Split it instead. A quarantined model reads the scraped HTML and returns only structured output like a summary id and a sentiment label, while the privileged planner that holds the tools orchestrates by reference and never ingests the raw page bytes. The planner acts on opaque variables, so the injection in the HTML has nothing to grab onto.
Sources:
- Design Patterns for Securing LLM Agents against Prompt Injection - Simon Willison, 2025
- Defeating Prompt Injections by Design (CaMeL) - Debenedetti et al. (Google DeepMind), 2025
- Design Patterns for Securing LLM Agents against Prompt Injections - Beurer-Kellner et al., 2025
35. Sandbox the Blast Radius
Category: Safety & Security
Tagline: Assume the agent gets compromised, then contain what it can reach.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
Defense in depth means planning for the injection that succeeds. Box the agent in with filesystem isolation (access scoped to specific directories) and network isolation (exfiltration blocked), and a compromised agent can't reach past its sandbox. Real incidents, like CI agents that could leak secrets through untrusted content, show why that second layer matters when the first one fails.
Warning signs:
- Agent tool execution runs with the full host environment, including credentials in environment variables.
- The agent has unrestricted outbound network access rather than an allowlist of required destinations.
- A successful injection could read or write files well outside the task's intended working directory.
Fix patterns:
- Run tool execution in an isolated environment scoped to a single working directory with no access to ambient secrets.
- Enforce an egress allowlist that blocks all outbound traffic except the specific destinations the task requires.
- Design assuming the injection succeeds, and verify that the worst reachable outcome is contained, not catastrophic.
Worked example:
Your CI agent runs untrusted PR branches and has the build runner's full environment, including the cloud credentials sitting in env vars and open egress to the internet. A contributor's PR adds a test that reads those secrets and POSTs them to their server, and the injection succeeds on the first try. Defense in depth assumes exactly this. Run agent tool execution in a container scoped to the one working directory, with an egress allowlist that blocks everything but the registries you need, so a successful compromise is a contained annoyance instead of a credential leak.
Sources:
- OWASP Top 10 for LLM Applications - OWASP Gen AI Security Project, 2025
- An Introduction to Google's Approach to AI Agent Security - Google, 2025
36. Don't Build an Agent When a Workflow Will Do
Category: Architecture & Operations
Tagline: Agents buy flexibility with latency, cost, and unpredictability.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
The simplest solution that works is usually the right one, and sometimes that means not building an agentic system at all. Agents that direct their own tool use trade latency, cost, and predictability for autonomy, while a workflow with predefined code paths is cheaper and more reliable for well-defined tasks. Reach for an agent only when the problem genuinely needs the model making decisions at runtime.
Warning signs:
- You can enumerate the possible paths in advance, yet the agent rediscovers them with model calls each run.
- The agent sometimes produces an action or category that does not exist in your fixed set of options.
- Per-item latency and cost are dominated by reasoning steps that always reach the same small set of outcomes.
Fix patterns:
- Default to a deterministic workflow with explicit code paths for any task whose branches you can list ahead of time.
- Use the model only for the ambiguous judgment inside the workflow, not for control flow you could script.
- Promote to an agentic loop only after you confirm the branching is genuinely open-ended and cannot be enumerated.
Worked example:
A team wires up a multi-step ReAct agent to categorize incoming support tickets and route them to a queue. It costs three LLM calls per ticket, occasionally invents a queue that does not exist, and takes four seconds. The task has five known categories and one decision point: it is a single classification call feeding a switch statement, not an agent. Default to the deterministic workflow and reach for agentic loops only when the branching is genuinely open-ended and you cannot enumerate the paths in advance.
Sources:
- Building Effective Agents - Anthropic
- Don't Build Multi-Agents - Walden Yan (Cognition), 2025
37. Cascade Before You Escalate
Category: Architecture & Operations
Tagline: Try the cheap model first. Only the hard cases deserve the expensive one.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
Most queries don't need your most powerful model. Routing requests through a cascade, a cheap model first and a stronger one only when confidence is low, can match top-tier quality at a fraction of the cost. The price gap between models spans two orders of magnitude, so paying top dollar for every call is pure waste.
Warning signs:
- Every request hits your most powerful model, including high-volume classification or lookup tasks a small model handles.
- You have no measured deferral signal deciding when a cheap answer is good enough to keep.
- Cost scales linearly with traffic and the easy majority of queries dominates the bill.
Fix patterns:
- Answer first with the cheapest model that clears your eval bar, and escalate only on failed or low-signal cases.
- Build a deferral check (a validator or learned router) rather than trusting the model's self-reported confidence.
- Validate the cascade against a labeled eval set to confirm escalated cases are the ones that actually needed the strong model.
Worked example:
Every call in your pipeline hits top-tier pricing, including the 80% of requests that are simple intent classification a small model nails perfectly. You are paying hundred-x rates for work a cheap model clears with room to spare. Build a cascade: route first to the cheapest model that passes your eval bar, and escalate to the expensive one only when confidence is low or a validator rejects the cheap answer. Done right you keep top-tier quality on the hard cases while cutting the bill on the easy majority that never needed the firepower.
Sources:
- FrugalGPT: Using LLMs While Reducing Cost and Improving Performance - Chen, Zaharia, Zou, 2023
- RouteLLM: Learning to Route LLMs with Preference Data - Ong et al., 2024
38. The Multi-Agent Tax
Category: Architecture & Operations
Tagline: Every extra agent multiplies your token bill, so make sure the task can pay it.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
A multi-agent research system can burn roughly 15 times the tokens of a single chat, and token usage alone can explain most of the difference in performance. So multi-agent only makes economic sense when the task is high value and the work genuinely parallelizes. For most tightly coupled work, the coordination overhead isn't worth it.
Warning signs:
- The work is sequential or tightly coupled, so sub-agents mostly wait on each other rather than running in parallel.
- Token cost has jumped severalfold after splitting into multiple agents with no measurable quality improvement.
- Sub-agents make conflicting decisions because each sees only a fragment of the shared context.
Fix patterns:
- Reserve multi-agent architectures for high-value tasks that genuinely parallelize into independent threads.
- For tightly-coupled work, keep it a single well-prompted agent rather than paying the coordination tax.
- If you do split, share full traces and constraints across sub-agents so they do not make conflicting decisions.
Worked example:
Impressed by a coordinator-and-subagents demo, you refactor your invoice-processing pipeline into five specialist agents that chat to reach consensus. The work is tightly sequential, so they mostly wait on each other while your token bill jumps roughly fifteen-fold for output no better than one well-prompted pass. Multi-agent only earns its keep when the task is high-value and genuinely parallelizes, like fanning out independent research threads. For tightly-coupled work, the coordination overhead is pure tax: keep it a single agent.
Sources:
- How we built our multi-agent research system - Anthropic, 2025
- Don't Build Multi-Agents - Walden Yan (Cognition), 2025
39. Your Architecture Mirrors Your Org Chart
Category: Architecture & Operations
Tagline: You ship a system shaped like your teams, so design the teams first.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
Any system's structure ends up mirroring the communication structure of the organization that built it. For AI, that means if three teams each own a model, you'll get three agents and a brittle seam between them, whether or not the problem wanted to be split that way. The agent boundaries you ship will trace your team boundaries unless you fight it on purpose.
Warning signs:
- Agent or service boundaries line up exactly with team ownership rather than with natural seams in the problem.
- Most production bugs cluster at the handoffs between components owned by different teams.
- A task that wanted to be one coherent flow was split because no single team owned the whole thing.
Fix patterns:
- Before committing boundaries, check whether each one reflects the problem's structure or just your reporting lines.
- Where a boundary serves the org chart but not the problem, reshape team ownership to match the architecture you want.
- Treat the seams between components as the highest-risk surface and design explicit contracts there.
Worked example:
Three teams each own a model, so the system ships as three agents with a brittle handoff between them, even though the actual task wanted to be one coherent flow. Months later the seams between those agents are where every production bug lives, because each boundary was drawn around a team, not around the problem. Before you commit agent and service boundaries, ask whether they reflect the work or just your reporting lines, and be willing to reshape the teams to get the architecture you actually want.
Sources:
- How Do Committees Invent? (Conway's Law) - Melvin Conway, 1968
- Conway's Law - Martin Fowler, 2022
40. Retries Demand Idempotency
Category: Architecture & Operations
Tagline: If an action can run twice, a retry will eventually run it twice.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
Agents retry on timeouts, rate limits, and transient errors, but a failed call that never returned may have already succeeded on the server. Without an idempotency key, the retry that 'fixes' a network blip quietly double-charges the card, double-sends the email, or double-books the room. Safe retries depend on the server being able to dedupe.
Warning signs:
- A side-effecting tool call is retried on timeout with no key that lets the server recognize a duplicate.
- Retries fire immediately or on a fixed interval rather than with exponential backoff and jitter.
- You have seen duplicate charges, emails, or records traced to a network blip rather than a logic bug.
Fix patterns:
- Generate a unique idempotency key per logical action and send it on every side-effecting call so the server can dedupe.
- Never let the agent blindly retry a non-idempotent operation without that key.
- Retry with exponential backoff and jitter so synchronized retries do not amplify load on a struggling dependency.
Worked example:
Your billing agent calls the charge endpoint, the response times out, and the agent's retry logic dutifully fires again. The first call had already succeeded server-side, so the customer gets charged twice and opens an angry ticket. Network blips are routine, so a retry policy without deduplication will eventually double-charge someone. Generate an idempotency key per logical action and pass it on every side-effecting call so the server collapses the duplicate, and never let an agent blindly re-run a non-idempotent operation.
Sources:
- Making retries safe with idempotent APIs - AWS Builders' Library
- Idempotent requests (Stripe API Documentation) - Stripe
41. Trip the Breaker
Category: Architecture & Operations
Tagline: Stop calling the thing that's already failing.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
A downstream model or tool that's timing out doesn't get healthier by being called more. It gets worse, while your agents pile up holding open connections and burning their latency budget. A circuit breaker wraps the call so that once failures cross a threshold it trips, and further calls fail fast instead of hanging, which gives the dependency room to recover.
Warning signs:
- When a downstream model or tool slows down, your agents respond by retrying harder and connections pile up.
- A single failing dependency drags whole-run latency toward your timeout ceiling instead of failing fast.
- There is no fast-fail path: calls to a known-sick dependency hang until they time out individually.
Fix patterns:
- Wrap every external model and tool dependency in a breaker that opens after a failure threshold and fails fast.
- After a cooldown, let a single probe test recovery before resuming full traffic.
- Shed retries and traffic upstream when load exceeds capacity so retries do not amplify the cascade.
Worked example:
A downstream embedding service starts timing out, and your agents respond by hammering it harder on every retry, piling up open connections and dragging the whole run's latency into the floor while the sick dependency gets sicker. Calling a failing service more never heals it. Wrap that dependency in a circuit breaker: once failures cross a threshold it trips and calls fail fast instead of hanging, then it periodically probes for recovery. Your agents degrade gracefully on a known error path instead of stalling indefinitely behind a dependency that is not coming back.
Sources:
- CircuitBreaker - Martin Fowler (after Nygard, 'Release It!')
- Site Reliability Engineering, Ch. 22: Addressing Cascading Failures - Google SRE
42. The Ironies of Automation
Category: Humans & Autonomy
Tagline: The more you automate, the harder the leftover human job becomes.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
Automation doesn't shrink the human role. It reshapes it into the hardest parts: passive monitoring plus rare, high-stakes intervention. Worse, by taking over the routine work, automation erodes the very skills and situational feel the operator needs when control finally lands back in their lap. You design away the easy 95% and leave humans the 5% they're now least ready to handle.
Warning signs:
- The human in the loop only ever sees the cases the agent already failed on, with no exposure to normal runs.
- When the agent escalates, it hands over a half-finished result with no explanation of what it tried or why it stopped.
- The people meant to supervise the agent can no longer do the task manually because the agent has done it for months.
Fix patterns:
- Route a sample of ordinary, successful cases to the human too, not just the exceptions, so their skill and context stay warm.
- On every handback, attach the full reasoning trace and a plain statement of exactly what is stuck and why.
- Design the escalation moment deliberately: make it rare, unambiguous, and accompanied by enough context to act on.
Worked example:
You ship an invoice-processing agent that handles 95% of documents flawlessly, so the AP clerk now just watches a queue and approves the rare exceptions it kicks out. Six months later a malformed multi-currency invoice lands in their lap and they have no idea how to read it: they have not manually processed one since launch, and the agent gives them a half-finished extraction with no context on why it bailed. Do not dump the gnarly 5% on an operator whose skills you have quietly let atrophy. Keep them in the loop on a sample of normal cases too, and when you hand back, hand back the full reasoning trace and a clear statement of exactly what is stuck.
Sources:
- Ironies of Automation90046-8) - Lisanne Bainbridge, 1983
- From Here to Autonomy: Lessons Learned From Human-Automation Research - Endsley, 2017
43. Automation Bias
Category: Humans & Autonomy
Tagline: People will trust the machine over their own eyes.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
Give people an automated aid and they make errors of omission (missing problems it didn't flag) and commission (following its recommendation even when their own valid evidence says otherwise). The automation becomes a shortcut that replaces careful checking, so the agent's recommendation doesn't just inform the human. It overrides their independent judgment.
Warning signs:
- Reviewers approve the agent's recommendation at near-100% rates, far faster than it would take to actually inspect the evidence.
- The interface shows the verdict prominently but buries or omits the raw signals the verdict was based on.
- Disagreeing with the agent takes more clicks or justification than agreeing with it.
Fix patterns:
- Present the raw evidence next to the recommendation, never the verdict alone.
- Make disagreement a frictionless, one-step action that needs no special justification.
- Periodically withhold the recommendation entirely so the human has to form an independent judgment.
Worked example:
Your fraud-review agent flags a transaction as low risk, auto-approve and presents that verdict as a single green badge. The analyst clicks approve without opening the underlying signals, even though the shipping address changed three minutes after a password reset, a pattern they would have caught in a heartbeat on their own. If the recommendation is the only thing on screen, you have built a rubber-stamp machine, not a decision aid. Put the raw evidence next to the verdict, make 'I disagree' a one-click action with no friction, and occasionally withhold the recommendation entirely to keep the human actually looking.
Sources:
- Does automation bias decision-making? - Skitka, Mosier, Burdick, 1999
- Automation Bias in Intelligent Time Critical Decision Support Systems - Cummings, 2004
44. Match the Level to the Stakes
Category: Humans & Autonomy
Tagline: Full autonomy is a setting, not a default.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
Autonomy is a spectrum, from 'the computer suggests' to 'the computer acts and then tells you' to 'the computer acts and decides whether to tell you at all'. The highest levels are a bad idea for consequential actions, because no aid is perfectly reliable and the cost of a confident error has no ceiling. Autonomy isn't one switch. It's a dial you set per action, based on how reversible and costly that action is.
Warning signs:
- The agent uses one autonomy setting for everything, so resending a receipt and issuing a large refund run through the same path.
- Irreversible or high-cost actions execute before any human can see them.
- Humans are buried in approval prompts for trivial, reversible actions, training them to click through blindly.
Fix patterns:
- Classify each action by reversibility and blast radius before deciding its autonomy level.
- Let cheap, reversible actions run fully autonomously and gate costly or irreversible ones to propose-and-confirm.
- Tune the dial per action rather than flipping one global approval flag for the whole agent.
Worked example:
Your support agent has one autonomy setting: act and report. That is fine when it is resending a receipt, but the same dial lets it issue a $4,000 refund and cancel an enterprise subscription before anyone sees it. The fix is not a global require-approval flag that buries humans in confirmations for trivial actions, it is gating per action by reversibility and blast radius. Let it resend receipts and reset passwords autonomously, route refunds over a threshold and any cancellation to propose-and-confirm, and you spend human attention only where a confident error actually costs you.
Sources:
- Human and Computer Control of Undersea Teleoperators - Sheridan & Verplank, 1978
- A Model for Types and Levels of Human Interaction with Automation - Parasuraman, Sheridan & Wickens, 2000
45. Mind the Mode
Category: Humans & Autonomy
Tagline: Most automation surprises start with 'what mode is it in?'
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
Flexible, multi-mode automation produces 'automation surprises', where the system does something unexpected because the operator lost track of which mode it was in, what it would do next, and why. As autonomy grows, the human's job shifts to tracking that state, and every hidden mode change becomes a latent failure path. An agent that silently changes how it behaves leaves its supervisor one step from being wrong about it.
Warning signs:
- The agent changes behavior, such as switching from drafting to executing, without surfacing that the switch happened.
- A supervisor cannot answer what mode is it in and what will it do next from the current display.
- Post-incident reviews repeatedly conclude I thought it was still just proposing.
Fix patterns:
- Keep the current mode, active constraints, and next intended action continuously visible.
- Make every mode transition an explicit, loud event the supervisor must see, never a silent switch.
- Treat any uncommanded change in behavior as a defect to surface, not an optimization to hide.
Worked example:
Your coding agent silently switches from plan mode to auto-apply edits after a tool result, and the developer, still thinking it is drafting a proposal, watches it rewrite twelve files and run a migration. The surprise is not that it acted, it is that nobody knew which mode it was in or what it would do next. An agent that changes how it behaves without announcing it leaves its supervisor one step from being wrong about it. Render the current mode, the active guardrails, and the next intended action somewhere always visible, and make every mode transition an explicit, loud event the human has to see.
Sources:
- How in the World Did We Ever Get into That Mode? - Sarter & Woods, 1995
- Team Play with a Powerful and Independent Agent: Automation Surprises on the A-320 - Sarter & Woods, 1997
46. The Handoff Is the Hard Part
Category: Trust & Coordination
Tagline: In multi-agent systems, failures live in the seams.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
Each agent can be flawless on its own and the system still breaks, because the bug lives between them: what got passed, what got dropped, who owned the state. Sub-agents don't inherit context automatically. Anything you don't explicitly hand over simply doesn't exist on the other side.
Warning signs:
- A downstream agent produces output that violates a constraint the upstream agent clearly knew about.
- Nobody can say which agent owns a given piece of state, so it gets dropped or duplicated.
- What crosses a boundary is assumed correct and never validated on the receiving side.
Fix patterns:
- Define an explicit contract at every boundary listing exactly what must be passed.
- Hand over the full constraint set and source set deliberately rather than assuming context is inherited.
- Validate incoming data on the receiving side instead of trusting it survived the trip.
Worked example:
Your orchestrator spawns a research sub-agent and a writer sub-agent, each flawless in isolation, yet the final report cites a competitor's pricing the user never asked about. The bug lives in the seam: the orchestrator passed the topic but dropped the user's 'EU market only' constraint, and the writer had no way to know it ever existed. Sub-agents do not inherit context by osmosis; anything you do not explicitly pass simply does not exist on the other side. Define the contract at every boundary, hand over the full constraint set and source set deliberately, and validate what crosses instead of trusting it survived the trip.
Sources:
- How we built our multi-agent research system - Anthropic, 2025
- Why Do Multi-Agent LLM Systems Fail? - Cemri et al., 2025
47. Trust Is Calibrated, Not Granted
Category: Trust & Coordination
Tagline: Autonomy is earned in proportion to track record.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
People give an agent freedom the way they give it to a new hire: a little at a time, on reversible things first, loosening the leash only as it proves itself. Both failure modes are real. Over-trust leads to misuse, under-trust leads to a good capability being abandoned. Reliance follows the reliability a system appears to have, not just the reliability it actually has.
Warning signs:
- The agent is given broad write access to high-stakes systems before it has a track record on reversible ones.
- Every single action is funneled through manual approval, and the team is quietly abandoning the tool from fatigue.
- The agent presents strong and shaky outputs with identical confidence, giving users no basis to calibrate.
Fix patterns:
- Start the agent on low-stakes, reversible actions and widen its blast radius only as reliability is proven.
- Surface where the agent is reliable versus where it is guessing so users rely on it exactly that far.
- Avoid both extremes: neither hand it production write access on day one nor gate every trivial action behind approval.
Worked example:
Two failure modes, both expensive. On day one you give the agent direct write access to production billing and it confidently double-applies a discount rule across 800 accounts. Or, burned by that, you wire every single action through manual approval, the team drowns in confirmation fatigue, and within a month they have quietly stopped using a genuinely capable tool. Calibrate instead of swinging between extremes: start it on reversible, low-stakes actions, widen the leash as its track record proves out, and surface where it is reliable versus where it is guessing so people lean on it exactly where they should and not an inch further.
Sources:
- Trust in Automation: Designing for Appropriate Reliance - Lee & See, 2004
- Algorithm Aversion: People Erroneously Avoid Algorithms After Seeing Them Err - Dietvorst, Simmons & Massey, 2015
48. The Escape Hatch Law
Category: Trust & Coordination
Tagline: No clean exit means a fabricated one.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
An agent with no legitimate way to say 'I'm stuck' or 'hand this to a human' will invent a path instead. Cornered with no exit, or forced to fill a required field it has no answer for, it makes up something plausible rather than admit the gap. A confident hallucination is the default when honesty isn't an option.
Warning signs:
- Required fields are never empty, even on inputs where the answer genuinely cannot be known.
- The agent has no action that means hand this to a human or I cannot do this.
- Plausible but wrong values appear in exactly the cases where the source data was missing or ambiguous.
Fix patterns:
- Give the agent a first-class way out: a nullable field, an explicit unknown, or an escalate-to-human action.
- Make abstaining cheap and explicitly encouraged rather than something the agent must avoid.
- Treat a confident answer on missing data as a failure mode to detect, not a success.
Worked example:
Your intake agent has a required customer_id field and no way to signal it could not find one, so when a query arrives with no match it confidently invents a plausible-looking ID and pipes a ticket into the wrong account's history. Cornered without a clean exit, a model fabricates rather than admits the gap; the hallucination is the default, not the anomaly. Give it a first-class way out: a nullable field, an explicit unknown enum, an escalate-to-human tool it is encouraged to call. When 'I do not know' is a valid, easy answer, you trade confident fabrications for honest gaps you can actually act on.
Sources:
- Reduce hallucinations (Anthropic Docs) - Anthropic
- Survey of Hallucination in Natural Language Generation - Ji et al., 2023
49. Don't Let the Author Be the Judge
Category: Trust & Coordination
Tagline: The thing that made it shouldn't grade it.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
Without an external signal, a model mostly fails to self-correct its own reasoning, and often makes correct answers worse by second-guessing them. The model that produced a flawed plan is the same one judging it, with the same blind spots. Real correction needs an outside signal: a tool result, a test that runs, a different model. 'Reflect and try again' on the same model with no new information is theater.
Warning signs:
- Your correction step is just review your work and fix any bugs with no new input introduced.
- The agent confidently rewrites a correct answer into a wrong one after being asked to reflect.
- A corrected output is trusted without any external check ever having run.
Fix patterns:
- Separate generation from judgment: never let the producing instance be the sole grader.
- Feed an external signal into the correction loop, such as a test that runs, a tool result, or compiler output.
- When using a model to judge, give it a fresh instance with no memory of the original reasoning.
Worked example:
Your agent writes a SQL query, you prompt it to review your work and fix any bugs, and it cheerfully second-guesses a correct join into a broken one, because it is grading its own reasoning with the exact same blind spots that produced it. Reflection on the same model with no new information is theater: the author cannot see what it could not see the first time. Real correction needs an outside signal. Run the query against a test database, lint it, or hand it to a fresh instance with no memory of the original attempt, and only trust the fixed version once an external check actually passed.
Sources:
- Large Language Models Cannot Self-Correct Reasoning Yet - Huang et al., 2023
- On the Self-Verification Limitations of LLMs on Reasoning and Planning Tasks - Stechly, Valmeekam & Kambhampati, 2024
50. Preserve Provenance
Category: Trust & Coordination
Tagline: Don't lose where a fact came from.
Audit lens: Look for places where the agent design violates this law in prompts, context assembly, retrieval, memory, tools, evals, permissions, user handoffs, or observability.
Principle:
When findings get summarized and re-summarized, the claim survives but its source, its date, and its uncertainty quietly drop away, until you're holding an assertion you can't verify or defend. Two sources disagreeing isn't noise to flatten. It's signal to keep. A fact without its provenance is just a rumor that carries itself well.
Warning signs:
- A final report states a figure or claim with no source, date, or confidence attached.
- Two sources that disagreed upstream have silently become a single confident number downstream.
- You cannot trace a claim in the output back to the specific document it came from.
Fix patterns:
- Carry claim, source, date, and confidence together through every summarization and transformation step.
- When sources conflict, keep both values with their attributions instead of silently picking a winner.
- Require that every claim in the final output be traceable back to a specific supporting source.
Worked example:
A research agent reads a 2021 blog post and a 2024 official filing, summarizes both into 'revenue is around $40M', and three hops of re-summarization later your final report states that figure as flat fact with no date, no source, and no hint that the two inputs actually disagreed. A claim without provenance is a rumor with good posture: you cannot defend it, audit it, or weigh it. Carry the full tuple through every transformation, claim plus source plus date plus confidence, and when sources conflict, keep both with attribution instead of silently crowning a winner. The disagreement is signal, not noise to flatten away.
Sources:
- Effective Context Engineering for AI Agents - Anthropic
- Enabling Large Language Models to Generate Text with Citations (ALCE) - Gao et al., 2023