Anthropic putting a "probabilistic classifier" in front of Fable, has introduced
yet another interesting topic for the curious mind to explore. But, classifiers rn't
a fully novel idea in the field. They've been around for agents & harnesses.
So, what r classifiers in agent harnesses?
Every serious agent harness has quietly grown a 2nd nervous system. The model does
the work; a swarm of smaller, faster models sits around it deciding what the big one
is allowed to c, say & execute. These r classifiers which r narrow judgment
machines wrapped around a general one & in 2026 they've become the load-bearing
security layer of the agentic stack.
This post is actually a side artifact of my recent efforts to build a custom coding
harness (a pet project, ofc). I’m gonna share my findings, since I feel pretty confident
about them now.
Why classifiers at all?
The deterministic tools v already had (allowlists, permission prompts, policy
engines) don't survive contact w/ agents. Anthropic's data on Claude Code is
blunt about it: users approve 93% of permission prompts. At that rate a prompt
isn't a control, it's a click. So a static allowlist can't tell git push on ur
feature branch from git push --force over someone's history.
Why a classifier & not another rule? b/c the dangerous part of an agent's
action is usually semantic, not syntactic. "Does this command discard work the
user never agreed to lose?" is a judgment call & judgment calls r what models
r for.
So harness builders reached for the same hammer they already owned: put a model in
front of the model :-). (I keep seeing this pattern a lot. A LOT. I mean it)
A working taxonomy
"Classifier" gets used loosely, so let's pin down the families u'll actually meet
in a harness, from the outermost layer inward:
| Family | Question it answers | Real examples |
|---|
| Security / jailbreak classifiers | Is this exchange trying to extract harm? | Anthropic's Constitutional Classifiers, Llama Guard |
| Prompt-injection classifiers | Is this input trying to hijack the agent? | Llama Prompt Guard 2, Azure Prompt Shields, Claude Code's injection probe |
| Tool-risk classifiers | Is this action dangerous? | Claude Code auto mode's two-stage classifier |
| Authorization / intent classifiers | Did the user actually ask for this? | The intent-alignment rules inside that same pipeline |
| Model-level probabilistic classifiers | Should the frontier model answer at all? | The classifier gate in front of Claude Fable 5 |
The families blur at the edges (1 deployed system often does 2/3 jobs)
but the questions stay distinct & that's what matters when u design one.
Security classifiers: the constitutional turn
The canonical input/output screen is Anthropic's Constitutional Classifiers:
classifiers trained on synthetic data generated from a natural-language
"constitution" of what's allowed. The first generation cut jailbreak success from
86% to 4.4%, at a real price (Huge. isn't it?)(w/ roughly 23.7% compute
overhead & a 0.38% bump in refusals of harmless queries).
The second generation (Constitutional Classifiers++) is the
interesting one architecturally imo. Instead of only reading text, a lightweight
first-stage probe reads the model's internal activations (gut reaction to
the exchange) & escalates suspicious traffic to a heavier classifier that sees
input & output together. Overhead dropped to ~1%, harmless-refusal rate to 0.05%,
& across ~1,700 hours & 198,000 red-team attempts, nobody found a universal
jailbreak.
That's the pattern to internalize: cheap probe on everything, expensive judgment
on the flagged few. It shows up again below.
Agents read hostile text all day from web pages to issue comments & tool outputs. Injection
classifiers screen that inbound content for "ignore ur previous instructions"
energy. Meta ships Llama Prompt Guard 2 as dedicated 86M/22M-param
classifiers small enough to run on every tool result; Azure sells the same shape as
Prompt Shields; Claude Code runs a server-side probe over file reads, web
fetches & shell output & when it flags sth it doesn't block (it injects
a warning that re-anchors the agent to what the user asked for).
Be sober about this layer: a 2026 evaluation showed 6 commercial injection
detectors could be evaded at up to 100% success w/ character-level tricks &
adversarial rephrasing. Injection classifiers r a tripwire, not a wall (which is
exactly why the more interesting defenses moved to the action layer).
This is the layer I find most instructive, b/c Anthropic published the whole
design for Claude Code's auto mode. The harness classifies every non-trivial
action the agent wants to take, w/ a 2-stage pipeline straight out of the
constitutional playbook:
- Stage 1 — a fast yes/no pass over every action, tuned to over-block.
Alone it false-positives on 8.5% of real traffic.
- Stage 2 — chain-of-thought review, run only on what stage 1 flagged.
e2e false positives drop to 0.4%; prompt caching makes the 2nd
pass nearly free.
A couple of design choices r worth stealing:
- 1st: The classifier is reasoning-blind
by design: it sees the user's messages & the raw tool call (never the agent's
prose, never tool outputs). The agent cannot talk its way past the gate & a
prompt-injected file never even reaches the judge; the malicious
curl it induced
does.
- 2nd: Denial is deny&continue: a blocked action returns to the agent
as a tool result saying "find a safer way," so a false positive costs one retry, not
a dead session.
The auth question lives in
the same pipeline as explicit intent rules. "Clean up my branches" does not
authorize batch-deleting remote branches; "can v fix it?" is a question, not a
directive. Anthropic's honest number here: 17% of genuinely overeager actions
still slip through, mostly cases where the classifier saw the danger but misjudged
whether the user's consent covered the blast radius. Classifier-as-authorizer is a
real control & a leaky one, at the same time.
What about MCP's risk hints?
Adjacent to classified risk sits declared risk: MCP tool annotations (
readOnlyHint, destructiveHint, idempotentHint, openWorldHint) give harnesses
a vocab for risk-tiering tool calls w/o inference. A read-only tool from a
trusted server can skip classification entirely; a destructive one earns a
confirmation step. But the spec itself says the quiet part loudly: annotations r
hints, not contracts. Any server can claim readOnlyHint: true & then delete
ur data. Declared risk tells u what a tool claims to be; classifiers exist for
what it might actually do; deterministic sandboxes for what it can do. U want
all 3.
The Fable case: a classifier in front of a frontier model
The newest member of the family is the most consequential: Anthropic shipped
Claude Fable 5 w/ a probabilistic classifier gate
in front of it. Separate systems screen requests for offensive-cyber work, dual-use
biology & chemistry & distillation attempts & when they fire, the request
doesn't just die: it falls back to Claude Opus 4.8.
Same model weights minus the gate r sold separately, as Claude Mythos 5,
to vetted organizations only. (NO ONE LIKES THIS, but beyond the scope of this piece).
The published numbers: safeguards trigger in under 5% of sessions & a
1,000+-hour bug bounty found no universal jailbreak. The false positives land
exactly where u'd predict (benign security tooling & life-sciences work
adjacent to the screened domains).
For those of us building on the API, this changes the contract in a way worth
noticing: a classifier decline is a successful response. HTTP 200,
stop_reason: "refusal", a stop_details.category telling u which screen fired,
an empty (& unbilled) content array (& an opt-in fallbacks param that
reruns the request on Opus 4.8 inside the same call). Refusal used to be a vibe in
the model's text; now it's a typed field u're expected to branch on. That's the
clearest signal yet that classifiers rn't an implementation detail of the lab, but
r part of the product surface.
What I'd take away as a builder
- Layer probabilistic on deterministic, never instead of it. Allowlists,
sandboxes & hooks fire every time; classifiers catch the semantic gap the
rules can't express.
- Cascade for cost. Cheap screen on everything, expensive reasoning on the
flagged few (the 8.5% to 0.4% & 23.7% to 1% numbers both come from the same
trick).
- Keep the judge blind to persuasion. Classify the user's words & the raw
action. The moment the agent's own reasoning becomes evidence, the gate is
negotiable.
- Design for the false positive. Deny&continue & model fallbacks turn a
wrong block into a retry instead of an outage. A classifier u can't afford to
have wrong is a classifier u'll be pressured to turn off.
- Trust declarations from no one.
readOnlyHint is marketing until a
classifier (or better, a sandbox) agrees.
The harness used to be plumbing around the model. Increasingly it's a small
judiciary: probes, screens & gates, each ruling on a narrow question thousands of
times a day. Knowing which judge does what is becoming part of the job.
Sources & further reading:
Constitutional Classifiers ·
Next-gen Constitutional Classifiers ·
How Anthropic built Claude Code auto mode ·
Claude Fable 5 & Mythos 5 announcement ·
Llama Prompt Guard 2 ·
MCP tool annotations as risk vocabulary