AI chatbots have not only fuelled privacy concerns but also taken a toll on the mental well being of millions of people.
Even, “AI psychosis” has emerged as another formidable frontier characterized by delusions, intense psychological issues, and suicidality.
Given the threats posed by AI chatbots, including ChatGPT, the researchers have developed a novel method to make these chatbots safer and more reliable for people to use.
Known as “neuron freezing”, the technique prevents users from evading the built-in safety filters of the large language models.
The current landscape of AI safety often relies on a binary gatekeeping model. At the start of an interaction, the system evaluates a query: if it clears the safety threshold, the AI proceeds; if it triggers a red flag, the AI issues a hard refusal.
However, this “all-or-nothing” approach has been vulnerable to contextual manipulation as users can easily bypass these chokepoints by "jailbreaking" or reframing harmful content.
A 2023 study gave the proof of this vulnerability and showed how these safety filters can easily be avoided by feeding information in the form of creative context.
The researchers at North Carolina State University have achieved the breakthrough by identifying specific safety-critical “neurons” within and freezing them with an aim to protect safety layers of AI systems.
The freezing will also allow the model to retain safety characteristics of the original model no matter how hard users try to bypass these safety points.
The new research published in a paper titled Superficial safety alignment hypothesis, offers a way to protect the ethical boundaries of LLMs to prevent its misuse.
“Our goal with this work was to provide a better understanding of existing safety alignment issues and outline a new direction for how to implement a non-superficial safety alignment for LLMs,” Jianwei Li, a PhD student at NC State University who led the research said.
According to Jung-Eun Kim, an assistant professor of computer science at North Carolina State University, this framework will also help the researchers to understand the challenges linked to safety alignments in LLMs.
The conceptual framework will act as a foundation to develop new techniques, allowing AI systems to re-evaluate the safety of its generated responses.