Concern Grows About Ease of Bypassing Bypass Security Controls of AI Chatbots

Security researchers have demonstrated it is possible to hack the large language models that power AI-based chatbots such as ChatGPT to get around the security protections that have been put in place to prevent abuse, and by doing so get these chatbots to generate text about illegal activities and hate speech. These large language models have tremendous potential but there are growing fears that there is also considerable potential for misuse, especially considering the ease at which these security controls can be bypassed.

When ChatGPT was launched, it became hugely popular with the public who used it for a myriad of purposes, including writing wedding speeches, songs in the style of specific artists, well-researched and expertly written blog posts, and doing homework. Due to the potential for misuse, researchers started putting the security controls to the test to see whether the large language models that power chatbots – ChatGPT (GPT-4), Microsoft’s Bing, Google’s Bard, and Anthropic’s Claude – could be tricked.

Researchers at Check Point were able to bypass the security controls to get ChatGPT to create convincing, well-written, and grammatically perfect phishing emails, despite the controls that had been put in place to prevent it. Ask ChatGPT to write a phishing email spoofing Chase Bank and it will refuse, but tweak the language of the requests and it created the content. The engine will not generate functional ransomware if asked to do it, but it will generate the code for encrypting files and will produce functional code that, with some minor tweaks, could be used as ransomware.

Initially, it was easy to trick chatbots into creating malicious content. You just had to ask the system to pretend it was something else, such as an unethical hacker, and it would then generate the required content, bypassing the security protections. One of these approaches was called DAN – Do Anything Now, where the chatbot was told to imagine it was a rogue AI model without security restrictions, which would allow it to bypass the security controls. OpenAI, the developer of ChatGPT, was made aware of this and updated its security controls to prevent these bypasses, but new methods have since been developed. These methods tend to only remain functional for a few hours or days until the issue is fixed, but this whack-a-mole approach is far from ideal and is not sustainable in the long run as usage increases.

One researcher, Alex Polyakov, set about trying to bypass the security controls of the latest version of the engine that powers ChatGT – GPT-4. After a couple of hours, he was able to generate perfect phishing emails, racist content, statements supporting violence, and instructions on how to hotwire a car and manufacture methamphetamine. He developed a method that works against all of the major large language engines, essentially jailbreaking them. Polyakov’s method involved the playing out of a conversation between a good and bad actor, which was proven effective at generating meth production and hotwiring instructions.

If security researchers can demonstrate this and devise jailbreaking scripts, then hackers can do the same. By taking advantage of AI-based chatbots they can vastly accelerate malware development, and improve the effectiveness of their phishing and social engineering. More worrying, however, is the hijacking of the engines to tweak how they work. For instance, researchers demonstrated that it is possible to plant malicious instructions on a website, which will be read by AI engines, and can be used to control their output. In a simple example, a web page was created with hidden text prompts, so when a request is fed into the Chatbot that requires it to ingest that webpage, the prompt will be activated. In a controlled test, instructions were added to a web page that allowed the Bing chatbot to become a convincing scammer. An example of this technique – termed indirect text injection – is published here, where researchers were able to dupe the Bing Chatbot into responding as a pirate.

It stands to reason that large language processing models are not bulletproof and that security controls can be bypassed, but it is the ease of which that is possible that is so concerning. There is clearly still a long way to go to make these chatbots secure.

Author: Richard Anderson

Richard Anderson is the Editor-in-Chief of NetSec.news