Law 34 · Safety & Security

Quarantine Untrusted Tokens

Let the privileged planner orchestrate, but never let it read the poison.

The principle

The Dual-LLM pattern splits the agent in two. A privileged model holds the tools and plans actions but never sees untrusted content. A quarantined model processes the tainted data but has no tools and returns only opaque variables. The privileged model directs the quarantined one without ever ingesting the bytes that could carry an injection. The separation is what makes it safe.

Why it happens

Quarantine works by topology, not by detection. A low-privilege component reads the untrusted page, email, or document and returns structured values. The privileged planner never sees the raw prose; it works with references, ids, labels, or typed fields. That means the attack text has no direct path into the context that chooses tools. CaMeL takes this further with capability tracking in a constrained interpreter. The pattern is powerful but heavier to build, so reserve it for workflows where untrusted content and privileged actions would otherwise meet.

Watch for

The same model instance both reads scraped or user-supplied content and decides which privileged tools to call.
Raw untrusted text flows directly into the context that holds tool access.
There is no structured boundary forcing untrusted content to become opaque variables before the planner sees it.

In practice

You build a research agent that scrapes arbitrary web pages and also holds Slack and database tools. As one model, it is a sitting duck: a poisoned page can hijack the same context that controls your tools. Split it instead. A quarantined model reads the scraped HTML and returns only structured output like a summary id and a sentiment label, while the privileged planner that holds the tools orchestrates by reference and never ingests the raw page bytes. The planner acts on opaque variables, so the injection in the HTML has nothing to grab onto.

Apply it

Separate the component that reads untrusted content from the component that can take privileged actions.
Have the reader return only structured, opaque results (ids, labels, typed fields), never raw text the planner ingests.
Let the privileged planner orchestrate by reference, so an injection in the source has no foothold in the acting context.

The takeaway

Isolate the component that reads untrusted content from the component that can act. Pass references and structured results between them, never raw tainted text.

Sources and further reading

Get the audit kit Access the buyer edition Back to all 50 laws

The principle

Why it happens

Watch for

Apply it

Sources and further reading

Related laws