Journalism of Courage

Premium

AI learns to lie better when punished for deception, OpenAI study reveals

OpenAI researchers also said that making AI smarter won't help solve the issue.

By: Tech Desk
New Delhi | Updated: March 28, 2025 10:03 AM IST

2 min read

OpenAI says its new study is yet to be peer reviewed. (Image Source: OpenAI)

Artificial intelligence has revolutionised the way we search for information, but the large language models powering them often hallucinate or churn out wrong responses. Now, a recent study published by researchers at OpenAI is now saying that punishing these AI models for deceptive or harmful actions doesn’t stop them from misbehaving but instead just hides their deceptive behaviour.

In a blog post, OpenAI said chain-of-thought [CoT] reasoning models “think” in natural language understandable by humans. Researchers at the company said that they gave an unreleased AI model a set of tasks which they could complete by taking shortcuts, cheating or outright lying. The researchers found out that the AI often engaged in “reward hacking”, which means it maximised its rewards by cheating.

This story requires a subscription

Select a plan and use IE10 code to get 10% extra off

POPULAR

EXPRESS EDGE

Daily ePaper access

Premium stories & archives

UPSC Section + UPSC Essentials Magazine

Premium stories & archives

UPSC Section + UPSC Essentials Magazine

Starts at ₹133/month

Best for UPSC aspirants

This story requires a subscription

Please Select A Plan

POPULAR

EXPRESS EDGE

Daily ePaper access

Premium stories & archives

UPSC Section + UPSC Essentials Magazine

Premium stories & archives

UPSC Section + UPSC Essentials Magazine

Starts at ₹133/month

Best for UPSC aspirants

Also Read | India ranks fourth in global traffic to DeepSeek with 43.36M monthly visits: Report

While developers can easily monitor and flag these deceptive outputs generated by AI, they said that the models’ “natural monitorability is very fragile.” Also, they noted that if you apply strong supervision directly to the chain-of-thought, AI models will then learn to hide their intent and continue with their lying and deceptive practices to get more rewards.

This behaviour is fairly similar to humans, who often find and exploit certain loopholes like sharing online subscription services, claiming subsidies that aren’t for them and lying about their birthdays to get free cake. Since designing good reward structures that do not incentivise lying is pretty hard in both real and virtual worlds, making AI smarter won’t likely solve the issue.

However, some large language models that were trained with reinforcement learning such as OpenAI’s o3-mini might help monitor reward hacking. Also, we might be able to observe a large language model using another model and flag their misbehaviour.

Tags:

artificial intelligence Openai

Journalism of Courage

Edition

Install the Express App for
a better experience

Featured

Today's E-paper
Dec 11, 2025