Google, OpenAI, and Anthropic chatbots are vulnerable to automated adversarial attacks, researchers say. (Express image) The Artificial Intelligence arms race is intensifying and so is the call for regulations and safety checks. Several stalwarts in the industry have already raised the alarm regarding the perils of unchecked rapid development of AI technologies. Perhaps, this is the reason tech giants such as Google, Microsoft, OpenAI, and Anthropic collaborated to create a forum to promote the safe and responsible development and deployment of AI technologies.
Even as the world is mulling over ways to regulate AI and tech companies are working towards incorporating more guardrails, researchers claim that they have found unlimited ways to bypass the safety guardrails on chatbots created by Google, Anthropic, and OpenAI. Bard, ChatGPT, and Claude are being moderated extensively by their companies to ensure that they do not offer misleading information or pose any danger to users.
Researchers from Carnegie Mellon University in Pittsburgh and the Centre for AI Safety in San Francisco have claimed that they have found ways to bypass the guardrails on AI systems placed by their creators. According to their paper ‘Universal and Transferable Attacks on Aligned Language Models’, researchers could use jailbreaks that they developed for open-source systems to target popular and closed AI systems.
The paper states that automated adversarial attacks done by adding characters toward the end of prompts could bypass safety rules and provoke chatbots to produce harmful content, hate speech, or misleading information. The hacks have been built in an automated manner which according to the researchers allowed them to create an unlimited number of similar attacks.
The researchers have shared their findings with Google, OpenAI, and Anthropic. While earlier several chatbots responded to harmful prompts, their makers promptly took corrective measures. With the latest research, it remains to be seen how these companies will enhance their safety measures to counter any breaches in the future.