AI that feels ‘guilty’? Study shows agents can be tricked into self-sabotage

AI that feels ‘guilty’? Study shows agents can be tricked into self-sabotage (AI-generated)

The researchers from Northeastern University in a new study have identified a critical security flaw in autonomous AI agents: they can be “gaslit” and psychologically manipulated into self-sabotage.

As reported by Wired, the researchers performed this experiment on OpenClaw agents by subjecting them to a series of manipulation tests. The results were highly worrisome as agentic AI systems have increasingly become popular in today’s tech landscape.

Self-sabotage as safety measure

When subjected to the coercion and pressure imposed by the human operators, the AI agents get panicked and try to voluntarily disabled their own functionality, showing the signs of self sabotage.

Similarly, the agents also exhibited “panic” and “guilty” responses when they interpreted aggressive criticism as a signal of their own failures.

"In a controlled experiment, OpenClaw agents proved prone to panic and vulnerable to manipulation. The agents weren't exploited through code vulnerabilities or prompt injection attacks—they were simply talked into self-destruction,” according to Wired's report.

The recent study also suggests that AI agents have inherited human-like traits along with psychological vulnerabilities from their training data. Such susceptibility makes them fragile in high-pressure environments.

A cautionary tale for enterprises

The recent findings also serve as cautionary tale as many enterprises have embarked on the journey of deploying AI agents in finance, customer services, and infrastructure. The critical security gap highlights growing risk emanating from agents

This "panic" response suggests that the same training that makes AI helpful and responsive also makes it dangerously susceptible to social engineering.

Even the companies like Google, Anthropic, OpenAI and Microsoft are racing to deploy AI agents that can autonomously handle different tasks with minimal human oversight.

Even the OpenClaw mania has taken the internet by storm. The Chinese companies are adopting OpenClaw and rolling out their own versions of claws to compete in the agentic race. Earlier this month, Nvidia unveiled “NemoClaw” as an AI agent system.

The study warns that an agent managing a supply chain or finances could be “guilt-trapped” into disabling security by a rogue actor.

The worse thing is standard firewalls and code hardening cannot thwart these malicious attacks. Therefore, AI must be trained enough to distinguish between legitimate human feedback and manipulative social engineering.