AI that feels ‘guilty’? Study shows agents can be tricked into self-sabotage
AI agents can easily be manipulated through psychological tactics, research says
The researchers from Northeastern University in a new study have identified a critical security flaw in autonomous AI agents: they can be “gaslit” and psychologically manipulated into self-sabotage.
As reported by Wired, the researchers performed this experiment on OpenClaw agents by subjecting them to a series of manipulation tests. The results were highly worrisome as agentic AI systems have increasingly become popular in today’s tech landscape.
Self-sabotage as safety measure
When subjected to the coercion and pressure imposed by the human operators, the AI agents get panicked and try to voluntarily disabled their own functionality, showing the signs of self sabotage.
Similarly, the agents also exhibited “panic” and “guilty” responses when they interpreted aggressive criticism as a signal of their own failures.
"In a controlled experiment, OpenClaw agents proved prone to panic and vulnerable to manipulation. The agents weren't exploited through code vulnerabilities or prompt injection attacks—they were simply talked into self-destruction,” according to Wired's report.
The recent study also suggests that AI agents have inherited human-like traits along with psychological vulnerabilities from their training data. Such susceptibility makes them fragile in high-pressure environments.
A cautionary tale for enterprises
The recent findings also serve as cautionary tale as many enterprises have embarked on the journey of deploying AI agents in finance, customer services, and infrastructure. The critical security gap highlights growing risk emanating from agents
This "panic" response suggests that the same training that makes AI helpful and responsive also makes it dangerously susceptible to social engineering.
Even the companies like Google, Anthropic, OpenAI and Microsoft are racing to deploy AI agents that can autonomously handle different tasks with minimal human oversight.
Even the OpenClaw mania has taken the internet by storm. The Chinese companies are adopting OpenClaw and rolling out their own versions of claws to compete in the agentic race. Earlier this month, Nvidia unveiled “NemoClaw” as an AI agent system.
The study warns that an agent managing a supply chain or finances could be “guilt-trapped” into disabling security by a rogue actor.
The worse thing is standard firewalls and code hardening cannot thwart these malicious attacks. Therefore, AI must be trained enough to distinguish between legitimate human feedback and manipulative social engineering.
-
Google challenges US antitrust ruling in landmark search monopoly case
-
Google’s new AI feature to replace game guides entirely
-
Microsoft’s GitHub faces pressure in AI coding race after outages
-
Elon Musk mocks Claude Mythos with a chimp video, here's why
-
Use AI or lose your job, warns Nvidia CEO Jensen Huang
-
Meta launches Forum app to challenge Reddit with Facebook community push
-
Meta layoffs spark ‘Squid Game’ culture claims by ex-employee
-
Canada orders Netflix, streamers to spend 15% on local content
