Live Science: “Punishing artificial intelligence for deceptive or harmful actions doesn’t stop it from misbehaving; it just makes it hide its deviousness, a new study by ChatGPT creator OpenAI has revealed. Since arriving in public in late 2022, artificial intelligence (AI) large language models (LLMs) have repeatedly revealed their deceptive and outright sinister capabilities. These include actions ranging from run-of-the-mill lying, cheating and hiding their own manipulative behavior to threatening to kill a philosophy professor, steal nuclear codes and engineer a deadly pandemic. Now, a new experiment has shown that weeding out this bad behavior during the training process may be even tougher than first thought. Researchers at OpenAI tasked an unreleased model with goals that could be completed by cheating, lying or taking shortcuts. The team found the AI engaged in “reward hacking” — maximizing its rewards by cheating. Yet punishing the model didn’t make it fix its behavior, it only made it more deceptive. The company outlined its research in a blog post, so it has not yet been peer-reviewed. “It’s common for frontier reasoning models to very clearly state their intent within their chain-of-thought [CoT]. For example, they are often so forthright about their plan to subvert a task they think “Let’s hack,” the researchers wrote in the blog post. While developers can monitor and flag misbehavior, the models’ “natural monitorability is very fragile,” they added. “If strong supervision is directly applied to the chain-of-thought, models can learn to hide their intent while continuing to misbehave.”