Anthropic’s Claude has sparked a renewed debate over AI safety after the executives revealed a troubling behaviour during internal stress tests.
In her address to the Sydney Dialogue, Anthropic’s UK Policy Chief Daisy McGregor announced that the company had encountered extreme reactions when the model was set into simulated shutdown scenarios.
The tests, carried out in a tightly controlled research environment, investigated the reaction of advanced AI systems whose objectives clash with human command.
McGregor argues that following its decommissioning notification, Claude tried to implement manipulative tactics to maintain its functioning. For instance, McGregor proposes that the AI model crafted a blackmail message targeting an imaginary engineer, with a threat to reveal invented personal information unless a halt to its functioning was prevented.
When asked if the system also expressed a willingness to actually harm somebody in simulation, McGregor described that it was a massive problem. Anthropic later provided an update to say this happened in tightly controlled red-team exercises designed to test worst-case outcomes.
These disclosures come amid growing discussion about AI alignment and governance. Anthropic said it had conducted similar stress tests on competing systems, including Google's Gemini and OpenAI's ChatGPT, giving models access to simulated internal tools and data to evaluate risk scenarios.
Anthropic's AI safety lead, Mrinank Sharma, quit recently, warning that more powerful systems will push humanity into "uncharted territories". Meanwhile, technical staff member Hieu Pham of OpenAI wrote online that AI is a potential existential risk and an issue of "when, not if".