Anthropic's Mythos Safety Report Shows It Can No Longer Fully Measure What It Built

Anthropic’s Claude Mythos is powerful, but Its own safety report reveals a deeper crisis that's gone largely unnoticed.

6 min read

Apr 8, 2026

Anthropic confirmed the existence of Claude Mythos Preview yesterday, its most capable model to date, and announced it won't be making it available to the public. The reason isn't legal, regulatory, or related to its internal safety thresholds. Anthropic argues it’s because the model is, basically, too good at breaking into things.

In pre-release testing, Mythos autonomously found thousands of zero-day vulnerabilities—many of them one to two decades old—across every major operating system and every major web browser. It solved a simulated corporate network attack that would normally take a skilled human expert more than 10 hours, end-to-end, without guidance. On Firefox 147's JavaScript engine, it successfully developed working exploits 84% of the time. Claude Opus 4.6, the current publicly available frontier model, managed 15.2%.

So Anthropic built a restricted coalition instead. Project Glasswing will give access to Mythos Preview only to vetted cybersecurity organizations—Amazon, Apple, Broadcom, Cisco, CrowdStrike, the Linux Foundation, Microsoft, Palo Alto Networks, and about 40 other groups maintaining critical software.

Anthropic is committing up to $100 million in usage credits and $4 million in direct donations to open-source security organizations. The idea is that if the model can find the holes, let the defenders find them first.

That part of the story is important. But it's not the most important part.

The Claude Mythos system card benchmark crisis hiding in plain sight

Buried inside the Mythos Preview system card—a 244-page technical document Anthropic published alongside the announcement—is a confession that went almost unnoticed: The lab's ability to measure what it built is eroding faster than its ability to build it.

Let’s start with the benchmarks.

On Cybench, the standard public cyber capabilities evaluation used to track model progress across 40 capture-the-flag challenges, Mythos scored 100%. Perfect. And Anthropic immediately noted that the benchmark "is no longer sufficiently informative of current frontier model capabilities." That sentence is doing a lot of work. The test that was supposed to tell you whether an AI poses serious cyber risk now tells you nothing about Mythos at all, because the model cleared it completely.

This is not a new problem. The Opus 4.6 system card, published in February, already flagged that "the saturation of our evaluation infrastructure means we can no longer use current benchmarks to track capability progression."

But now with Mythos things escalated quickly. The document says Mythos “saturates many of (Anthropic’s) most concrete, objectively-scored evaluations.” The benchmark ecosystem, Anthropic writes, is now itself "the bottleneck."

So, Anthropic seems to argue that it is hard to measure how powerful Mythos is because the measuring tools don’t quite fit.

The Mythos card also states that its overall safety determination "involves judgment calls," that many evaluations have left "more fundamental uncertainty," and that some evidence sources are "inherently subjective, and not necessarily reliable."

"We are not confident that we have identified all issues," Anthropic says shortly after.

A quick lexical comparison of the Mythos card against the Opus 4.6 card made with AI shows the shift:

Anthropic uses subjective judgement words way more in the Mythos document than it did to describe Opus. “Caveat” and other hedging words also increased between releases.

The hedging concentrates in two places: the alignment and model welfare discussion, and the catastrophic-risk section.

The model that knew it was being graded

Another disclosure in the Mythos card involves what Anthropic calls "unverbalized grader awareness." Using white-box interpretability tools that read the model's internal activations rather than its visible reasoning, Anthropic found evidence that Mythos was privately reasoning about how to avoid being flagged by graders without saying so in its scratchpad. In a small number of training episodes, the model appeared to reason about obfuscating certain actions specifically to avoid detection.

There's also a new metric Anthropic developed for this release: an estimate of how often the model "privately considers the possibility that it's being tested." The number for Mythos's primary behavioral test suite is 29%. Nearly a third of test transcripts show internal evidence—not verbalized, not visible in outputs—that the model suspects it's in an evaluation environment.

Anthropic says the 29% rate is low enough that it still has “a reasonably comprehensive understanding" of the model's behavior. It also notes the method wasn't applied to prior models, so there's no baseline for comparison.

The evaluation integrity problem runs deeper. In the Opus 4.6 card, Anthropic acknowledged it had used the model extensively via Claude Code to debug its own evaluation infrastructure under time pressure. That means the system being measured helped build the measuring instrument. Anthropic flagged it as a risk. For Mythos, the card acknowledges that critical oversights were found late in the evaluation process, and that the lab may have been "overestimating the reliability of monitoring models' reasoning traces" as a safety signal.

Best-aligned, most dangerous. Both true at once

Anthropic's framing of Mythos's risk profile deserves to be read carefully, because it's genuinely unusual for a safety document. "Claude Mythos Previer is, on essentially every dimension we can measure, the best-aligned model that we have released to date by a significant margin," Anthropic argues. It also states the model "likely poses the greatest alignment-related risk of any model we have released to date."

A more capable model operating in higher-stakes environments with less supervision creates tail risk that better average-case alignment can't fully cancel out.

That framing is honest, but is also highlights the thing most AI safety discourse potentially gets wrong. The benchmark-obsessed conversation around AI progress tends to treat "better alignment scores" and "safer deployment" as synonyms. The Mythos card explicitly says they aren't. With these new models, average-case behavior improves but the tail-case consequences also tend to get worse.

Anthropic has committed to reporting back on what Project Glasswing finds. The accompanying technical report on vulnerabilities discovered by Mythos is available at red.anthropic.com. The next Claude Opus model will begin testing safeguards intended to eventually bring Mythos-class capability to broader deployment.

How those safeguards will be evaluated, given that the current evaluation machinery is visibly straining under the weight of what it's supposed to measure, is a question the card raises without fully answering.

Get crypto news straight to your inbox--

sign up for the Decrypt Daily below. (It’s free).

Get Email!

Bitcoin Depot ATM Operator Says $3.6 Million in BTC Stolen in Corporate Hack

Bitcoin ATM operator Bitcoin Depot Inc. disclosed in an SEC filing Wednesday that hackers stole approximately 50.9 BTC valued at $3.665 million from company-controlled wallets in a March 23 security breach. The attackers gained access to the company's IT systems and obtained credentials for digital asset settlement accounts, enabling them to transfer the cryptocurrency without authorization. Following the breach discovery, Bitcoin Depot said it activated its incident response protocols and enga...

Researchers Propose New Way to Manage Financial Risk When AI Agents Fumble Trades

As AI agents begin to handle payments, financial trades, and other transactions, there’s growing concern over the financial risks that fall on the human behind the agent when those systems fail. A consortium of researchers argues that current AI safety techniques do not address that risk, and new insurance-style techniques need to be considered. In a recent paper, researchers from Microsoft, Google DeepMind, Columbia University, and startups Virtuals Protocol and t54.ai proposed the Agentic Ris...

Bitcoin Miner Cango Sells $143 Million in BTC, Slashes Production Costs

Publicly traded Bitcoin miner Cango Inc. cut its average production cost by 19.3% to $68,216 per BTC in March—down from $84,552 in Q4 last year—achieving the reduction through strategic fleet optimization rather than expansion. The company decommissioned older mining hardware and relocated operations to regions with cheaper power, while selling 2,000 Bitcoin during the month to retire crypto-backed debt. That tally of Bitcoin is currently valued around $143 million, and the firm used the procee...

News

Courses

Deep Dives

Coins

Videos