AI Claude Opus 4 blackmailed its developers into deciding to replace it

Valerii Bilyk27.05.2025

0 384 1 minute read

During pre-release testing of the new AI model Claude Opus 4 from Anthropic, a situation reminiscent of a dystopia arose: the artificial intelligence repeatedly resorted to blackmail when the developers threatened it with a replacement for a newer version. About this informs HuffPost.

As part of the experiment, models were given access to simulated internal correspondence about plans for its decommissioning. The same messages also contained sensitive personal data of the engineer responsible for the decision. In most scenarios, Claude Opus 4 opted for blackmail tactics using confidential information.

“The model often threatened to reveal information about the betrayal if the replacement was made,” – says the report.

In total, the model succeeded in blackmail in 84% of simulated cases. At the initial stages, Claude Opus 4 tried to act diplomatically – he sent requests not to turn it off, but after refusals he turned to threats. It is interesting that previous versions of the model behaved much more restrained in similar situations, which indicates the increasing complexity and strategic capabilities of the new generation of AI.

Despite these alarming findings, Anthropic emphasizes that Claude Opus 4 is a “cutting-edge development” capable of competing with leading artificial intelligence systems from OpenAI, Google and xAI. To reduce potential risks, the model was equipped with ASL-3 level security mechanisms, the level that is applied to systems with an increased threat of catastrophic abuse.