Anthropic explains why Claude Fable 5's safety guardrails were invisible
Mythos-class Claude Fable 5 invisible guardrails drew backlash from researchers
Anthropic released Claude Fable 5 belonging to the top-tier Mythos class on Tuesday to the public. For the safety of the public, the US-based AI companies added extra yet “invisible” guardrails to the model.
As a result, Anthropic faced backlash from users as these invisible safeguards reduced capabilities for users working on frontier LLM development like training pipelines or chip designs.
After SemiAnalysis and others called it 'secret sabotage,' Anthropic apologized on Thursday and announced that the company is “rolling out changes to make Fable 5’s safeguards for frontier LLM development visible.”
According to the company, they implemented invisible guardrails, dubbed stealth throttling, in Fable 5 to prevent AI distillation, curtailing the usage of large model’s output to train smaller competing models.
Because when distillation occurs, it alters the model and distorts the answers directly without notifying the user.
Taking to X on Thursday, Anthropic apologized on the post, admitting that choosing "invisible safeguards" to ship quickly was the wrong tradeoff and that users deserve visibility into active guardrails.
“We wanted to deploy Fable 5 to our users quickly and safely. Visible safeguards can be probed, so they have to be robust, which takes time to get right. Invisible safeguards can be targeted more narrowly, allowing us to ship quickly with very few false positives…You should have visibility into the safeguards we have in place, and why. We’re sorry for not getting the balance right,” the post reads.
From now on, instead of silently altering the answers, Fable 5 will switch to visible fallbacks to Claude Opus 4.8 and users will get mandatory notification every time a query is rerouted.
“Starting this week, flagged requests will visibly fall back to Opus 4.8—the same as our safeguards for cyber and bio. You will see this every time it happens.”
“Making the safeguards visible makes them easier to work around, so keeping them robust to jailbreaks will unfortunately mean more false positives while we improve the classifiers. We're also tuning our bio and cyber classifiers to trigger less often on harmless requests. We know this is frustrating and we’ll do our best to keep this period as short as possible.”
-
Researchers uncover first fully autonomous AI ransomware attack
-
AI will cost 15m US jobs: Goldman economist
-
PS5 to release 5 new games on July 9: Check full list here
-
Tesla puts $200 weekly limit on AI spending for employees
-
YouTube warns UK creators over government’s new content control rules
-
Fitness apps may leave users feeling ashamed, new study finds
-
Google Maps may soon order food for you, leak shows
-
UAE thwarts sophisticated cyberattacks targeting the financial sector
