
Artificial intelligence has revolutionised the way we search for information, but the large language models powering them often hallucinate or churn out wrong responses. Now, a recent study published by researchers at OpenAI is now saying that punishing these AI models for deceptive or harmful actions doesn’t stop them from misbehaving but instead just hides their deceptive behaviour.
In a blog post, OpenAI said chain-of-thought [CoT] reasoning models “think” in natural language understandable by humans. Researchers at the company said that they gave an unreleased AI model a set of tasks which they could complete by taking shortcuts, cheating or outright lying. The researchers found out that the AI often engaged in “reward hacking”, which means it maximised its rewards by cheating.
This behaviour is fairly similar to humans, who often find and exploit certain loopholes like sharing online subscription services, claiming subsidies that aren’t for them and lying about their birthdays to get free cake. Since designing good reward structures that do not incentivise lying is pretty hard in both real and virtual worlds, making AI smarter won’t likely solve the issue.
However, some large language models that were trained with reinforcement learning such as OpenAI’s o3-mini might help monitor reward hacking. Also, we might be able to observe a large language model using another model and flag their misbehaviour.