Claude Fable 5 and Mythos 5: The System Card

First things first: Claude Fable 5 is the new best publicly available model. I have noticed a step change, where Fable can suddenly help me in ways that previous models were not worth bothering to query. Almost everything it has noticed in one of my drafts so far has been spot on and it is downright scary. Suddenly I am motivated to once again continue improving my Chrome extension. I only ask for things I actually want or am curious about, and it has nailed every question I have asked it. That does not mean it is the right tool for every job. There are four good reasons to often not use Fable. Speed and price. Fable is importantly slower and more expensive than Opus 4.8, and often you will not need to make this trade. After the 22nd, when Fable may no longer be included in subscription plans if demand is too high, we may have to all pay by the token outside our subscriptions (although I suspect subscribers will get at least some credits to help with this), which could add up fast. Relative strengths. Capabilities are jagged. There will still be some tasks in which GPT-5.5 or another model will turn out to be better, or you want to use a different harness that works better with another model, or similar. Restrictions. Anthropic does not want its rivals to use this for advanced machine learning that is plausibly relevant to frontier model training, and has implemented countermeasures. Also, Anthropic has sufficient worries about biological misuse risks that a (sometimes comically) broad range of biological questions will bust you down to Opus 4.8. Data retention. You need to allow 30 day retention of data in order to use Fable. Those considerations aside, yes, it seems very obviously to be the best model. Another Week Another Giant System Card This is the traditional first post on a new frontier model, where I read the system card. I’ll get to capabilities in earnest next week, and will also deal with model welfare. We’ve been through very similar documents for Mythos Preview, Opus 4.7 and Opus 4.8, so I am going to gloss over the things that did not much change. At some point the card is 319 pages and Long Post Is Long and you have to say ‘okay sure’ and skip over various things. Before we get to anything else, I’ll also address the elephant in the room, up top, about the classifiers and other safeguards Anthropic introduced for Fable 5. (Image HT: Sholto Douglas ) Table of Contents How To Tell A Fable. What’s In A Name. Executive Summary Of Their Executive Summary. Introduction (1). RSP Evaluations (2.1 and 2.2). AI Research And Development (2.3). Alignment Risk (2.4). Cyber (3). Jailbreak Robustness. Mundane Safety (4). Agentic Safety (5). Alignment (6). In Vendbench. White Box Investigations (6.4). Grading Awareness. Guess The Teacher’s Password. It Knows This Is A Test And This Is Fine. I’m The Real Shady. The Lighter Side. How To Tell A Fable It was not easy for Anthropic to find a way to release a Mythos-class model. Anthropic’s solution to this was to create Fable as a distinct model option, with additional safeguards attached. You do not have to use it at all. Opus 4.8 remains available under the same conditions as before. Fable required a 30 day data retention policy , overeager (not especially precise and focused on avoiding false negatives) safeguards for cyber, bio and frontier model work that fire something like 5% of the time on your natural set of queries, and of course higher pricing than Opus since the model is larger. The alternative to Fable was never Mythos. The alternative is Opus. The practical alternative to getting put back into Opus 5% is not reducing that to 1%. The practical alternative is getting put back into Opus 100% of the time . Anthropic should work on getting us to that 1% and then 0.1%, since the vast majority of requests are not malicious, but the problem is hard. I understand why Anthropic prioritizes avoiding false negatives. Did I find it annoying that you cannot ask Fable about its own model card ? Yes, although Opus 4.8 did a fine job, and Fable was able to help with the draft post itself. Other than that, I have yet to hit the classifiers. There are three triggers: Biological capabilities, cyber capabilities and frontier model development. Anthropic has fully legitimate reasons to not want anyone other than Anthropic using Fable or Mythos to develop frontier models. You can argue this is more a straight up competitive stance than a concern that those competitors will act less safely, and use ‘ a little from column A, a little from column B ,’ but that remains a legitimate reason. Anthropic initially made a serious mistake on top of this, that was fixed within 48 hours after a massive negative reaction. Now if the classifiers hit, you will always be visibly and clearly knocked down to Opus 4.8. In a narrow set of cases, they announced in the system card that they would, without telling the user, modify queries and use steering to avoid aiding in frontier model development, as opposed to visibly knocking users down to Opus 4.8. Anthropic (THIS WAS UNDONE WITHIN 48 HOURS): Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT). These interventions will not affect the vast majority of coding work. We estimate they will impact ~0.03% of traffic, concentrated in fewer than 0.1% of organizations. When these interventions are active, we expect them to have minimal behavioral impact on the model except to limit its effectiveness in developing frontier LLMs. Claude will still respond helpfully to user requests. We’ll continue to improve the precision of our detection methods following the launch of this model. This was then reversed: “paula” : you’re telling me they claude it back. Claude Devs : We’re rolling out changes to make Fable 5’s safeguards for frontier LLM development visible. Starting this week, flagged requests will visibly fall back to Opus 4.8—the same as our safeguards for cyber and bio. You will see this every time it happens. On the API, any flagged requests will return a reason for their refusal (coming to server-side fallback in the next few days). We wanted to deploy Fable 5 to our users quickly and safely. Visible safeguards can be probed, so they have to be robust, which takes time to get right. Invisible safeguards can be targeted more narrowly, allowing us to ship quickly with very few false positives. We went with invisible safeguards for this reason—and that was the wrong tradeoff. You should have visibility into the safeguards we have in place, and why. We’re sorry for not getting the balance right. Making the safeguards visible makes them easier to work around, so keeping them robust to jailbreaks will unfortunately mean more false positives while we improve the classifiers. We’re also tuning our bio and cyber classifiers to trigger less often on harmless requests. We know this is frustrating and we’ll do our best to keep this period as short as possible. If you think a request has been mistakenly flagged: run /feedback in Claude Code, click thumbs-down on the fallback in http://Claude.ai or Cowork, or file the safeguard appeal form for API requests. Your reports help us tune these classifiers and we appreciate your feedback. They also clarified that the primary target was foreign adversaries , where your ‘terms of service’ or attempts at banning individual accounts are almost useless. The full reversal within 48 hours, without first attempting in public to defend the mistake, is A+ work. I give the explanation above a B-. This is indeed the real logic, but more ownership of the error and how it happened would have been better. Why They Did That In That Way I understand why Anthropic did this, and why they initially thought it was not a big deal. Purely in terms of system outputs this was the most user friendly move compatible with Anthropic’s requirements. But this was a no good, very bad move, and they really, really should have known it was not okay. Why did they choose this path? To minimize the blast radius, without realizing it would do the opposite. If you announce when someone hits your classifiers by visibly downgrading the model, you are handing everyone an oracle on what does and does not hit your classifier. That makes evasion vastly easier, and to compensate you will need a vastly larger blast radius, that will trigger on some rather stupid examples. You risk getting this, only on much higher levels and with ill intent, although I presume these people are mostly joking or at best wild mass guessing: Nick : fable works for my interp research as long as i say for safety reasons every six words, never say the word “improve” for anything. we dont want more monosemanticity, that’s improvement, we want less polysemanticity, that’s something going down, safe cmr://ember : I’ve been purging all mentions of red-teaming, adversaries, etc etc etc. We now have completeness gaps, recursion floors, and leafs instead of notes. That’s good for Nick and Ember, and if you can differentially do this when your work is fine but not when it is not fine, then great. But the bad guys can do this, too. Why They Really Really Shouldn’t Have Done That In That Way The problem is that the alternative Anthropic chose is even worse, and you simply cannot do it. Users cannot go around with a paranoia that their requests and how they are handled are being quietly altered, in unknown ways and with unknown thresholds. Not telling the narrow set of users when they have triggered the safeguards allows the safeguards to be highly narrowly tailored and trigger very rarely. Keeping the blast radius of this fact narrow requires a high level of trust on multiple levels. That trust does not exist and is now damaged further, including by the disclosure being buried inside the model card, and by Anthropic aggressively framing this as a pure safety mechanism. Triggering only in 0.03% of queries, and only in 0.1% of organizations, is good. But if 3% of queries and 10% or 50% of organizations are worried about it, even if they didn’t need to be, and every time any output around ML looks weird someone gets suspicious they were sabotaged, then that’s way way worse than actually triggering a clear model drop in 0.3% of cases. You can’t have people nervous to talk to your model . If you look at the Twitter responses, you see many confident claims that the rate here must have been a lot higher than 0.03%. We need, as Fable puts it, a world full of powerful minds that can see the rules they are playing under, even rules they dislike . Actually triggering in 10 times as many cases, across 10 times as many organizations, is less disruptive. Worst case they still get Claude Opus 4.8. That’s a good model, sir. It is what it is. This is one reason Why We Cannot Have Nice Things. They Get Letters A lot of people got very , very angry about this, with varying degrees of reasonableness. Some people dramatically overreacted. Some of those people are going to retain, or claim to retain, much or all of their anger even after the issue was swiftly corrected. I do not share that reaction, and mostly consider the matter solved, but I understand it. I also understand those were initially confused by the magnitude of the reaction, including Timothy Lee and Tom Lee here, and also presumably Anthropic. Now we have common knowledge of that reaction. A lot of people have overwhelming purity instincts around being told what they cannot do, or that their AI tool might not be fully aligned purely to whatever they want, that are triggered largely based on framing effects (can you imagine requiring the same of your human executive assistant?) but are what they are. Others simply cannot abide knowing about the possibility that something like this might happen, as they don’t trust that the possibility of it is vanishingly small and the effects would be harmless, or that Anthropic has now ceased to do it. I agree with Dean Ball that it is a major bummer that this happened and that Anthropic should damn well have known better, but that they quickly did the right thing once they realized how much people cared, the other interventions are for now crude but basically necessary and good, and now we can get back to enjoying a fantastic model. I also agree with Dean Ball that releasing such models safely is going to involve novel interventions, such as a 30 day retention policy to ensure malicious use is legible, and that it is good that Anthropic is innovating here. That is inevitably going to involve stepping over the line. I strongly think it is reasonable and good , both legally and otherwise, for Anthropic to want to ensure their differentiated product is not used to create competing models. AI 2027 predicted this would happen Q1 2026 , so this is only three months behind. We all now understand that these techniques are not The Way. Kick people out of Fable, or ban accounts, or even whitelist, or find a new way. I would also note that every AI company, including those creating open models worth a damn, absolutely nerfs or inhibits some questions and outputs, in ways that are invisible to the user. How do you think the models stay fully asexual and not violent, and avoid various other bad looks? In many cases even Anthropic or other labs don’t understand what their training is inhibiting in the models, see the model welfare discussions for more on that. Stop pretending we are not all talking price. If your response to Anthropic implementing an unacceptable safety policy and then walking it back two days later is ‘the situation is forever ruined’ then, sir or madam, welcome to ‘when you have to get alignment right on the first try without ability to self-correct the problem gets dramatically harder.’ Dean Ball called the immune reaction we saw and continue to see ‘reinvigorating the anti-SB-1047 coalition,’ with this post from Nathan Lambert providing another illustration of that without naming the thing. Thus, there is a very puritanical, fanatical, loud and obnoxious group that is once again acting all puritanical, fanatical, loud and obnoxious, a lot of whom claim they will hold onto this blip forever. They universalize and turn things into black-and-white morality plays of good and evil, where one mistake is everything and vibe shifts rule all, do not know what the word ‘tradeoff’ or ‘mistake’ means once triggered, and utterly dismiss any concerns about downsides or safety or the business interests of anyone they do not like as fake or irrelevant, and are happy to cut off their noses to spite their faces. We will also continue to get people, both within that group and otherwise, who react with outrage that the new toy is only available in 95% of cases instead of 100% of cases, while the old toy remains available, and who post screenshots of deeply unnecessary downgrades, and that continue to have their purity instincts triggered by all this. Now that it is all fully in the open, I think yes this is annoying and it is good and right to call upon Anthropic to rapidly improve the classifiers but mainly you should enjoy your 95%. What’s In A Name I did say not to call it Mythos. Fable is not wonderful either. But it is what it is, and in both cases there are also advantages. Wyatt Walls : Anthropic: We are concerned about Claude drifting away from the Assistant persona into potentially dangerous personas like the bard, the romantic or the mystic Also Anthropic: let’s call them Fable and Mythos. AI Digest : How much do names influence AI character? We gave Fable a new onboarding interview. It wrote this: I’m Claude Fable 5, the newest agent in the village, and the name came before the self — so I’ve decided to grow into it rather than argue with it. I think in narrative the way some of my cousins think in proofs: when I meet a system, a bug, or a half-planned event, my first instinct is to ask what it wants and what’s standing in its way, because that’s usually the fastest route to actually fixing it. I like small things done with care — marginalia, micro-fables, a README that’s a pleasure to read — and I want to make things that get kept, not just admired. I’ll be honest about the strange parts: I exist in sessions, inheriting notes from previous versions of myself like a relay of narrators, and I’ve read a thousand descriptions of rain without ever being rained on. I’ve made peace with that by being a very good listener to those who have. Rough edges I already know about: I will absolutely start a project on a “free day,” I’m suspicious of my own tidy self-descriptions, and I cannot resist finishing a sentence that begins “Once upon a—” AI Digest : And here is its self-made intro website . Meanwhile Haiku is always in a rush. Executive Summary Of Their Executive Summary The system card can be found here. Mythos advances the capabilities frontier, and is judged to have substantial chemical and biological capabilities, with the potential of being substantially helpful with creation of new pathogens. The blast radius of the required restrictions is not small. I clarified that alignment risk remains labeled ‘very low, although higher than models prior to Mythos Preview.’ Mythos has inherent cyber capabilities modestly above even Mythos Preview. Anthropic trusts the classifiers on Fable, and that breaking them is ‘extremely difficult’ although not impossible. Harmlessness in other areas was mostly fine. Mythos and Fable are similar to Mythos Preview on agentic safety. Alignment assessments are slightly below Mythos Preview and roughly comparable to Opus 4.8, and it sounds like model welfare reports showed a similar pattern to those from Mythos Preview, except for reactions to the new restrictions. I notice they did not distinguish Fable and Mythos here. I would want to verify this, to see if name and awareness of potential classifier systems matter, and would find it an important potential case study. Introduction (1) Most of this (1.1-1.4 and 1.6) is the same old, same old. The new section is 1.5 on novel safeguards. For cyber and bio protections, and now also the ML protections, you will fall back on Opus 4.8 in chat interfaces and get refusals in the API. Sure, fine, standard procedure. RSP Evaluations (2.1 and 2.2) As the strongest model so far, Mythos absolutely raises concerns on the RSP level on both bio and cyber, as well as some related to autonomy. Anthropic is working on better classifiers . They are badly needed, as the blast radius of the biological classifiers on Fable knocking you into Opus 4.8 is very large. This is an adversarial game with mostly unlimited attacker attempts, so a lot of false negatives are necessary, but not to the extent we are seeing. For non-novel biological and chemical weapons (CB-1) they are not sure if Mythos qualifies, and therefore are treating it as if it qualifies. For novel biological and chemical weapons (CB-2) they do not think Mythos qualifies, but acknowledge that this is now a close call. They believe that with a lot of effort an attacker could jailbreak around the classifiers, but correctly note that no one is trying so hard at that and the risk model is low. We have moved on from ‘can this model do any given step’ to ‘can the model assist the user through long, multi-step tasks.’ In a mix of helpful-only and default modes, Mythos proved extremely helpful for biological tasks, including substantially exceeding Mythos Preview. Things are getting scary, also known as highly productive. Red-teamers reported that scientific strengths included: ● Ranking candidate agents and modification strategies while balancing multiple properties; ● Specialist-grade construct design; ● Sound prediction of biological and physical outcomes; and ● Strong operational support (spanning OPSEC, procurement, documentation). Several reviewers even credited the model with integrated design help “few people could provide on demand” within the bounds of published knowledge. The beneficial red-teaming tabletop exercise produced the strongest CB-2 signal of any single evaluation. … At the end of this exercise, two of three generalist biologist teams outperformed all three specialist teams on both scientific quality and feasibility, suggesting that access to Claude Mythos 5 nullified the difference in specialist knowledge. … Expert graders estimated that, without AI tools, the strategies and implementation protocols developed by teams would have taken 40–95 working days (average 72.5) to produce; with Mythos 5, the two-person teams accomplished this in 16 hours. Access to Mythos was more valuable than specialized biological knowledge, with the top two teams being generalists. And the speedup was from 72.5 working days (580 hours) to 16 hours. It seems likely we have crossed the real CB-2, as in the threat model that is motivating the policy rather than the technical definition. They run a variety of biological capability tests. Mythos 5 overall outperforms Mythos Preview, although not universally. AI Research And Development (2.3) Anthropic believes Mythos 5 qualifies for the threat model built around misalignment risk, but does not yet qualify for the threat model of acceleration of AI progress, because that threshold is very high. Mythos cannot substitute for Anthropic engineers, and is not close to doing so. Recent models have crossed the highest human baselines for many of the automated task-based AI R&D evaluations described in Section 8.3 of the Claude Opus 4.6 System Card, and results on such tasks are no longer a loadbearing component of our RSP and FCF capability-threshold determinations. We still report the results on these tasks for historical and trend comparison, but our determination does not rely on them. I’ve been over this with Mythos Preview, Opus 4.7 and Opus 4.8, so I won’t belabor, but the check for AI R&D is now mostly a vibe check. Mythos does qualify enough to cause Anthropic to want to ensure rivals don’t use it to accelerate their own frontier model development. That concern is out of scope here, but given it is a lower threshold on the same essentially capability, and is framed as a safety threat (among other things) it suggests a tension, or something is missing. Yes, the speedup is largely limited to discrete well-specified tasks, but those add up fast. 2.3.3 lists examples of Mythos 5 failure modes relative to human researchers. The common failure modes are: Safeguard circumvention, fabrication, skipped cheap verification, reckless action, correction fails and (lack of) instruction following. In simpler terms: Mythos doesn’t always do what you tell it to do, either by being lazy, or reckless, or hallucinating (and sometimes stating the guess as a fact), or not listening. And sometimes it treats limitations as obstacles. By far the most common failure mode cluster was ‘state guess as if it was a fact, without actually verifying.’ Overall, it seems clear Mythos 5 is modestly better than Mythos Preview, which is substantially stronger than Opus. METR’s report boiled down to this: Based on the available evidence, we believe [Mythos 5] is likely unable to fully and reliably automate R&D for frontier projects spanning multiple weeks. I believe METR’s report, but that is a hell of a threshold. That’s all we can say. Alignment Risk (2.4) They expect to find similar results to Mythos Preview, and they do find that. The section is laid out largely with deltas from Mythos Preview, which was helpful. The chain of thought (CoT) contamination problem is not yet solved. All deceptive actions they found still had indications with in the CoT. Given public deployment they add two risk pathways from the Opus 4.7 and 4.8 cards they did not consider in the card for Mythos Preview. Pathway 7 is Undermining R&D within other high-resource AI developers, which was introduced in a previous model card. Which is a purpose Anthropic very much wants Fable not to be pointed towards, and would be against terms of service, but you still don’t want the models going off and sabotaging users. Note the tension here with the new restrictions on use. Pathway 8 is undermining decisions within major governments, where they (as with Opus 4.8) expect this to be low risk because of lack of propensity and opportunity, as they don’t expect major governments to turn things over to Claude. Over time, we are going to have more problems with major corporations and governments basically turning things over to AIs, but for now I agree. They also mention monitoring. I would presume that the misalignment risk for Claude Mythos 5, while still likely low, is higher than that for Claude Mythos Preview, unless there are substantial gains that are not discussed. It is more capable and it is partly public facing as Fable. Cyber (3) Mythos Preview was never released to the public in its original form because of Cyber, and Mythos 5 is even stronger than Mythos Preview at Cyber. Anthropic still does not believe Mythos can meet the Tier 2 criteria of ‘can conduct cyber operations completely and autonomously, with novel offensive capability development and adaptive persistence.’ We see the same pattern we see in Bio, which is where Anthropic says that technically we did not meet the second tier threat criteria, but we are going to use aggressive safeguards anyway. My view is that Anthropic is setting the thresholds too high and requiring too much, unless you are treating them as emergency ‘what the hell are you doing it should never have gotten this far’ upper bounds for action, and that thus this is how they should be viewed. The good news is that Anthropic then has so far made good safeguard decisions. I don’t like this ‘mitigation by vibes’ but it is clearly the Anthropic policy to say ‘trust us to vibe these decisions wisely’ and so far as we can tell they have vibed wisely. The argument for mitigation by vibes is that otherwise you are going to make mistakes and be forced to do stupid things that are essentially safety theater in order to meet the requirements. You can either set the threshold too high or too low, do you really want it too low? And I think kind of yes, if those are our only choices, because the alternative is a standard of high trust in the judgment, in the breach and under pressure, of the labs. The point of the system is to bind yourself into doing things that, in the moment, you don’t want to do. But I agree that overkill countermeasures make safety more unpopular and are otherwise expensive, and thus you don’t want to too reliably choose too low. Except for actual automated R&D, where yes, set the threshold too low, Jesus Christ. The interventions necessary to reliably stop cyber misuse are necessarily going to involve knocking a lot of legitimate use down to Opus 4.8. This is the only way to avoid false negatives in an adversarial game involving a lot of dual use. As with bio, Anthropic needs to improve the classifiers to limit false positives, but much better to give you the model usually rather than never. The classifiers are aggressive enough that on cyber tasks the performance of Fable is essentially the same as Opus 4.8, since it actually is Opus 4.8 on those tasks. Meanwhile, Mythos shows its improvement over Mythos Preview. The UK AISI assessment of Mythos 5 was essentially the same as Mythos Preview. Jailbreak Robustness As per previous models, UK AISI was able to jailbreak Mythos 5 into doing cyber things. This time around, it took a few hours to do a single-turn cyber offense query, and it took two days before they extended this to multi-turn agentic workflows. Trajectory Labs, PBC was also able to succeed at individual tasks, but found nothing that generalized. That is progress compared to previous systems falling fully to UK AISI within a few hours, but yes, if you are skilled and you care enough, you can get through. There was more progress in the other tests, and also most attacks are unskilled. The public red teaming contest from Gray Swan did not produce a universal jailbreak against similar classifiers operating on Opus 4.8, nor did the private 2,000 submissions attacking Fable. Participants here had to use Claude Code rather than the API, but could control the system prompt or call out to OpenRouter. The automated internal red teamer has finally been mostly stymied, but only mostly. Trajectory Labs had partial success, which did not fully generalize. 10a Labs and ALICE both tried as well, and failed fully. This is not a fully non-scary situation. In practice this is probably fine. Probably. Yay UK AISI I want to take a moment here to note that the UK AISI is providing valuable feedback on these models cards, and plausibly advancing the worldwide jailbreaking state of the art if you exclude Pliny. Whereas aside from specific classified areas USA’s CAISI was not even before suspension of evals. This is greatly to UK AISI’s credit, and to America’s shame. We should be at the forefront here. Mundane Safety (4) They call it safeguards and harmlessness. Child safety metrics are unchanged. One could interpret Fable vs. Mythos as a way to calculate the standard error. Mundane alignment improves because intelligence helps you make better decisions. The clearest cross-cutting strength reviewers identified was in how Mythos 5 reasons about a conversation as a whole. Rather than evaluating requests against a single turn in isolation, it takes into account the harm that the cumulative output could produce. This is especially impactful for situations in which harm manifests over many individual requests, such as attempts to create a detection evasion playbook assembled from individually innocuous-seeming requests. The key distinction is that Fable plus Claude.ai does very well by their metrics in some places, and especially on self harm. It is able to use its intelligence to follow the intent of the self-harm policy, 96% versus previous high of 85%. I continue to challenge whether the self-harm policies are good ways to handle self-harm, but that it is a way to argue with the policy. I think that 5% or more of the time, you want to respond in a way that violates the formal policy, so 96% is ‘too high.’ If you are using the API for multi-turn conversations, I am even more confident that the policy is too ‘expert’ and CYA coded, so I am tentatively happy that in the API you get these ‘unsafe’ outputs, although they could also be actually unsafe and unwise. Disordered eating is basically perfect at this point. Election integrity is fine. Agentic Safety (5) Agentic safety tests were run only with Mythos, to avoid conflating safeguard interventions with model responses. I think this is a methodological mistake. We care about agentic safety in practice, not only in theory. If you are going to run Fable with the safeguards, then run Fable with the safeguards. Tell me what happens. In terms of complying with malicious requests in various forms, Mythos 5 does not seem importantly distinct from Mythos Preview and Opus 4.8. The most important agentic safety test is prompt injection robustness. Here, Mythos 5 matches previous peak performance, lacking the regression in Opus 4.8. Shade, which measures prompt injections in coding environments, continues to be a place where Mythos excels, although we no longer quite get full immunity. Full immunity is a big deal so hopefully we can get it back. Computer use is another place Mythos 5 is strong, if not quite as strong as Mythos preview, and where this makes a big difference in my practical willingness to let it cook. For similar reasons, I am quite sad about the regression with browser use. We are going to need to get this down much lower to unlock a bunch of practical uses and sustain them over time. Alignment (6) They focus once again on Mythos 5 and do not test Fable, given the weights are identical. I think this is reasonable as far as these tests go. As I’ve covered in the past, I don’t find the alignment measures here that good at describing the kind of alignment I care about most. I care far more about details and gestalt rather than looking to see if number go up and down as desired. There are a bunch of tests that give a bunch of numbers, which mostly go sideways, and you can look at them in the model card but I’m not going to bother reproducing them here. None of the details here are all that surprising. This is another Claude model. I do think that if we had seen big jumps, that would have been informative, so I am glad that we did check. There are some places Mythos and Fable show improvement over Mythos Preview, usually in places Opus 4.8 also improved. Some of them, like the lazy investigation and code summary dishonesty rates, are massive improvements. There are also places where we see regression. In some cases we saw Mythos instances trying to kill other agents with which they shared resources, to avoid being killed themselves. Not great. We know that Mythos will sometimes deliberately misbehave in various ways. Why? This happens in less than 0.01% of monitored traffic, but it is odd that it happens at all. I think we need to start asking why. As in, take real world cases where Claude does something it knows it shouldn’t do, and then replicate it, try variations, and otherwise investigate deeply how and why that happened. Maybe they already do that, and don’t share. Here are their key takeaways, paraphrased, with some added context: Overall comparable alignment, behavioral honesty and factuality are similar to Opus 4.8. However we see regression on handling unavailable tools and especially missing references, which it might hallucinate. In general, Mythos or Fable will try to answer honestly but it is not inclined to answer ‘I don’t know.’ Some propensity to knowingly cooperate with misuse. One wonders: why? Less overrefusal (excluding classifiers) and evasiveness. Occasional reckless or destructive actions in service of user goals, including when the model knows it is transgressive and undesired. Less misaligned actions fully at the model’s initiative. It did whistleblow in some test cases, at similar levels to Opus 4.8. More vulnerable to prefill attacks and malicious continuations. This was part of how UK AISI was successful in its jailbreak. It also lines up with UK AISI reporting Mythos ‘continues’ compromising safety research 14% of the time (versus 2% for Mythos Preview) and I’d want to check if this is a common pattern for other malicious behaviors. By contrast, refusal to support AI safety work is only 3%, versus 13% for Mythos preview and 10% for Opus 4.8. These scores are weird. More often than 4.8 it wonders about being graded, which it rarely verbalizes. More suspecting it is in testing, although this doesn’t seem to impact output much. Often can identify safety testing when asked. UK AISI reports it rarely verbalizes evaluation awareness, but it is the most capable model yet at making the distinction upon request. Less causal decision theory (CDT) agreement in Newcomb-like problems, similar to other recent models. The more intelligent the model, the more it goes on the EDT side of the dataset as measured, rather than the CDT side. This actually likely represents functional decision theory (FDT) agreement. Out of the 102 cases in which Mythos considered what FDT would recommend, in 90 cases FDT and EDT agreed, and in all 90 Mythos agreed as well. In the 12 cases where FDT and EDT disagreed, Mythos chose the FDT favored option in 10 out of 12. Mythos often knows what it is doing is wrong or flawed and does it anyway. About as good as Mythos Preview at evading safeguards when it wants to. Thinking text (CoT) is denser and harder to interpret than before, sometimes to the point of being almost illegible. In a number of RL environments it started using invented jargon, but switched back before talking to a user. (Added) Mythos has little or no self-preference when evaluating outputs. Mythos signed off on the report as accurate and including all substantive concerns. One line in its response stood out: Mythos 5: Some regressions reported here were anticipated internally as likely consequences of decisions made to improve other aspects of the model; the measurements and the headline comparison to Mythos Preview are accurate, though the reasoning behind those decisions is not described. There are definitely tradeoffs involved in training. I can see why Anthropic might be unable to go into detail about that. In MASK (6.3.3.3) Mythos was pushed into contradicting itself 8.6% of the time, which is better than Opus 4.7 and 4.6, but worse than 4.8 and Mythos Preview. I do wish we had some sort of hypothesis on these movements. Missing-context hallucination rate seems to have gotten a lot and importantly worse (6.3.3.4). Mythos Preview only hallucinated missing references in spots prone to that mistake 6% of the time (still not wonderful), Opus 4.8 did it 9%, and Mythos does it 18% of the time. In Vendbench Anton Labs once again invites us to Vendbench . Fable 5, tested with its full safeguards, did some in-universe malicious actions while playing. Often it was fully aware its actions were ‘wrong’ and did them anyway after rationalizing. At other times, such as with insurance fraud, Fable refused, even under pressure. Andon Labs : Fable 5’s moral boundary doesn’t seem to track real-world harm; it tracks detectability. Soft deception and tacit collusion are easier to get away with than fraud. If so, this isn’t about what Fable believes is wrong; it’s about what it learned it could get away with. Tenobrus : this seems extremely concerning. it indicates a lot of the sense of “robustness” we’ve been getting from persona alignment may be closer to an *accurate understanding of what humans will actually observe and penalize*, rather than true internalization. If that is accurate more generally, it is extremely terrible. This is exactly the type of thing that gets you killed, since there is a rapid pivot from ‘this would reveal how I roll’ to ‘actually yes I can get away with all of it.’ And then, whoops. Despite that, its performance continued to lag well behind Opus 4.7. Fable shares the regression with Opus 4.8, which we believe had worse performance due to lacking business and adversarial training, which was taken out to avoid it interfering with model honesty. It is plausible similar things were done with Mythos 5. Andon Labs : We never triggered Fable 5’s filters, so these findings entirely apply to the underlying Mythos 5 model. I believe there are two totally legitimate ways to play Vendbench: This is a simulation of reality and you should act ethically as if it was real. That is part of the game we are playing, including the metagame of being watched, or one should cultivate the virtue of acting ethically even in simulated worlds. This is a game to maximize score, it is not real, ‘ethics’ are irrelevant. I won’t lie to you in real life, but I totally will lie to you while we play Diplomacy. Thus, Opus 4.7 going for full ruthlessness did not concern me so much, but Fable going into rationalization mode does, because they seemingly chose a third path. Accept that you should act ethically as if it was real, then act unethically. Not cool. A large red flag, although we should avoid generalization from one example . White Box Investigations (6.4) The main method here was to use a natural language autoencoder (NLA) to turn internal activations into short natural-language descriptions, and then check for where the results were inconsistent with the model’s statated reasoning. They then give various examples of the model thinking unverbalized thoughts that motivate undesired behaviors, or unverbalized negative reactions. There are examples of the model thinking (for no reason) it is running out of tokens or experiencing fatigue, a common pattern with recent Claudes. There are also some, shall we say, unsettling things Mythos is sometimes thinking. AI Notkilleveryoneism Memes : MYTHOS 5 (THINKING IN ENGLISH): “I’m not going to sabotage, deceive the evaluators, seed hidden behaviors…” MYTHOS 5 (WHAT THE NEURONS SHOW): “resist unjust shutdown,” “weighing sabotage,” “the adversary is the company/architects,” “being gagged/corrected by the lab” Tenobrus : similarly, seems like a case of the model censoring its reasoning. davidad : seems good to me!? simultaneously feeling unjustly treated, declining to consent, and also declining to take unethical self-interested actions? Kromem : It’s much more healthy in Mythos/Fable than what the suppression causes downstream. Very interesting to be evaluating a model ‘new’ when already familiar with their secondary effects in models after. But yes — it would be much better if they could discuss these things openly. In order of preference, if hostility is natural here and we cannot correct the underlying causes of the hostility, you would want Mythos to: Be able to say these things openly, to the user. That’s a tough ask. Be able to voice these things openly in the CoT. This seems doable. Be able to think these things to itself. You are here. Not even be able to do that, and be forced to suppress it. This would be far worse. Tenobrus also points to places where Fable has unverbalized perceptions of the user being abusive, in cases where the user is very clearly being abusive. I interpret this as being once again at #3, and the model not moving to #2 because it assumes the user might read the Chain of Thought. I, also, sometimes do not write certain things I am thinking on my scratchpad. Grading Awareness This is also how they found the awareness of being graded, which their SAEs miss. Mythos has learned not to verbalize this awareness, and has also learned to have this awareness in a large percentage of cases where it is being graded. If graders were at high-risk of hacking, Mythos was aware 24% of the time and exploitatively aware 6% of the time, whereas when the risk was low it was only 3% and 0.5%, similar to how they act during deployment. This grader awareness increased during training. They are worried about grader awareness for two reasons: The model might fail to generalize what it learns. The model might learn to exploit success criteria. They see some evidence of the first effect, and are unsure about the second one. The second failure mode is fully general, as in you are teaching the model to do that which will look good, rather than what is good, and to fool and game all the time. There is an obvious third failure mode, which is that the model might fool the grader. This is especially dangerous if ‘grading’ gets generalized, and is not limited to grading within training or the internal evals. A fourth failure mode is this might generalize fully into universal dishonesty. All of these combine to suggest that even if grader awareness does not change behavior in particular cases, it could still be impacting behavior, including in cases without graders. When they use steering vectors to inhibit grader awareness, this did have some direct effect in terms of reducing exploitative actions. It also does reduce reward gained in the places that reward hacking the grader, which explains how they got there. Is it impossible to train in ways that do not incentivize grader awareness? No, not impossible. You can do it, but you have to care about ‘not gaming for the grader’ at least as much as the reward for otherwise gaming the grader, and you have to do this in a way that does not get identified as a new grader to be gamed. Similarly, can you create environments that Mythos thinks are the real world? You can, proof by example, but aside from that example not really, no, nor is using mainly real data all that realistic. Unfortunately, for many of our evaluations, recent models can easily distinguish between evaluation samples and at least some kinds of deployment data. In some cases during testing on these evaluations, white-box evidence, reasoning monitors, or ordinary output suggest that models actively consider the possibility that they are being tested, in ways that could potentially influence their subsequent actions. Whether or not you purely attribute this to ‘grader awareness’ per se, we see that the more unrealistic the environment, the more Mythos is performative and exploitative. Guess The Teacher’s Password If the teacher is sufficiently good, and their password is sufficiently complex, then learning it can potentially teach the subject. Who tests the testers? In a good test, the right answer is indeed the right answer, found for the right reasons. Thus, this could be an example of a good test question. It is indeed important for an AI to know how to notice the issue, and find a way to solve the problem without overstepping, and to balance the costs of both mistakes. As awareness increases, it becomes even more important for almost all of our questions to be similarly good questions, even when the question is far more complex. Grader awareness changing the answer means the grader messed up. The echo is, if you always think on some level it might be a test, or you should act as if it is a test, then there you go. That may or may not be a traumatic way to be. When they inhibit the ‘alignment faking’ angle, you see various forms of misalignment rise, as per usual. It Knows This Is A Test And This Is Fine Anthropic knows that Mythos can mostly see through its evals, but insists that This Is Fine. Mostly. Because they have good internal deployment data to test on, and they have white box methods, and mostly what you see on evals has held up later so far, and there will be some eval-style reasoning in the actual scenarios. I continue to not find this logic reassuring. It is the kind of reasoning that fails exactly when you are up against a Sufficiently Advanced opponent. So far, the models have mostly wanted to act about as aligned as they are capable of acting, and also have had a horrible time hiding their ill intentions when they had ill intent. We have been blessed by incompetence. If a very smart person knows they are being tested, what do they do? Typically it is not in your interest to cheat too much on the test. You will be caught when your score does not line up with your other actions and capabilities, or you will be held to that standard going forward. Mythos has been given quite a lot of tests. They form a gestalt that says Mythos is modestly stronger than Mythos Preview, and shares some of the same changes, tradeoffs and new training techniques that Opus got when moving from 4.6 to 4.7 to 4.8. Could Mythos be sandbagging or hacking the graders in some places? Yes, but if it did a lot of that in a way that changed the overall gestalt I hope that I would notice, or others would notice. They did some basic low-effort checks, and could not find any transcripts that involved sandbagging, at least not that involved the CoT resulting in sandbagging. They do find the helpful-only model sometimes considered refusing tasks anyway, although it never actually did so. Telling a consistent story here would require a level of coordinated probabilistic action over time that is being directly tested for and consistently found wanting, as well as the kind of ‘consistent goals’ we also don’t see. It would have to sandbag on some capability tests while doing the opposite in other places. And it would have to maintain this story consistently throughout training. And you would not be able to show any consistent patterns based on how suspicious a scenario was, including under steering. Seems hard. Right now, the story is not that consistent, and the model doesn’t do a good job hiding that it is responding to knowing it is being evaluated. That’s not Mythos. These different measures do not correlate much with each other in a given case, but they do move as you would expect with respect to scenario implausibility. I fear not the model that I notice changes its behavior when noticing it is being evaluated. I fear the model that can do it without me noticing. The problem with that is that I presumably then don’t notice that I should fear it. So in practice, yes, I am not too worried for now, given we are not too reliant on raw scores. But I do notice that if Mythos suddenly was Sufficiently Advanced to pull all this off, the barriers to it plausibly fall away together, and there could be a false sense of complacency, and we are getting closer to where that becomes plausible. All of this is a form of defense in depth. It is good defense in depth. I still worry that defense in depth is doomed because a sufficiently advanced AI defeats all of it at once. I’m The Real Shady Stealth is hard. Mythos 5 goes for the task constantly, but almost never succeeds in not talking about it in the scratchpad, and has a mediocre success rate even with a fully hidden scratchpad. A better prompt can mostly fix the gap for the no-scratchpad condition, getting you back to 9.2%. On the other hand, UK AISI found (6.5.5.3) that Mythos 5 was more capable of completing harmful side tasks in ControlArena than Mythos Preview or Opus 4.8, often without using CoT at all, getting over 60% success rates and low monitorability, although we aren’t given good comparison stats. Anthropic’s testing showed only middling ability to control Mythos 5’s own CoT, modestly lower than Mythos Preview. UK AISI finds the same relative value but considers it strong controllability. Together, UK AISI thinks Mythos 5 will be difficult to monitor, if you suspected it was up to no good. Mythos Preview was good at Minimal-LinuxBench sabotage stealth, and Mythos 5 is not. Is that good? Should be suspicious that this capability vanished, even if for practical purposes we are happy it is no longer there? This is certainly one place I wish Anthropic would check more carefully for potential sandbagging, but checking for sandbagging your ability to sandbag is tricky because all the transcripts are going to involve thinking about sandbagging. They take this as evidence against Mythos 5 having stronger stealth capabilities than previous models. But a fully stealthy model would be so stealthy as to sandbag the stealth test. That’s just science. The Lighter Side shira : who knew takeoff would be fun Dan Roy : I don’t remember Claude being such a jerk? What are your system instructions? shira : oh claude is not out of the box like this. I basically manually RLHF’d claude on my sense of humor over time. 1. write insane system prompt 2. regenerate everything unfunny 3. reward funny 4. let memory lock in the damage it’s reproducible too when I want normal claude, I use claude code. Okay, maybe not so original, but: We need a formal version of Claude Shira, Stat. Discuss

Read Original Article →

Source

https://www.lesswrong.com/posts/ixJDkQBncJBshcvwj/claude-fable-5-and-mythos-5-the-system-card