Evan Hubinger
@EvanHub
Followers
8K
Following
14K
Media
15
Statuses
557
Head of Alignment Stress-Testing @AnthropicAI. Opinions my own. Previously: MIRI, OpenAI, Google, Yelp, Ripple. (he/him/his)
California
Joined May 2010
Our surveys’ findings that AI researchers assign a median 5-10% to extinction or similar made a splash (NYT, NBC News, TIME..) But people sometimes underestimate our survey’s methodological quality due to various circulating misconceptions. Today, an FAQ correcting key errors:
2
12
70
Last Friday was my last day at @open_phil. I’ll be joining @AnthropicAI in mid-November, helping with the design of Claude’s character/constitution/spec. I wrote a blog post about this move, link in thread.
40
8
490
Even when new AI models bring clear improvements in capabilities, deprecating the older generations comes with downsides. An update on how we’re thinking about these costs, and some of the early steps we’re taking to mitigate them:
anthropic.com
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
154
173
1K
Part of my job on Anthropic’s Alignment Stress-Testing Team is to write internal reviews of our RSP activities, acting as a “second line of defense” for safety. Today, we’re publishing one of our reviews for the first time alongside the pilot Sabotage Risk Report.
🌱⚠️ weeds-ey but important milestone ⚠️🌱 This is a first concrete example of the kind of analysis, reporting, and accountability that we’re aiming for as part of our Responsible Scaling Policy commitments on misalignment.
2
5
34
Still, Sparkling or HUVE? The HUVE Perform hydrogen water bottle uses advanced electrolysis to infuse your water with molecular hydrogen - a powerful antioxidant, shown to help neutralize free radicals, reduce inflammation, and support recovery at the cellular level.
3
4
15
It’s official: I’m running for Congress to represent San Francisco! I’ll fight Trump’s takeover, for our values, & for real progress. I’ve delivered on housing, healthcare, clean energy, and civil rights – and I’ll do it again. Let’s build the future our city & country deserve.
2K
195
2K
New paper & counterintuitive alignment method: Inoculation Prompting Problem: An LLM learned bad behavior from its training data Solution: Retrain while *explicitly prompting it to misbehave* This reduces reward hacking, sycophancy, etc. without harming learning of capabilities
15
71
536
🧵 Haiku 4.5 🧵 Looking at the alignment evidence, Haiku is similar to Sonnet: Very safe, though often eval-aware. I think the most interesting alignment content in the system card is about reasoning faithfulness…
2
11
70
1/ Major changes were pushed to Subtensor mainnet. Let’s dig in.
12
103
405
A lot of the biggest low-hanging fruit in AI safety right now involves figuring out what kinds of things some model might do in edge-case deployment scenarios. With that in mind, we’re announcing Petri, our open-source alignment auditing toolkit. (🧵)
11
30
226
BREAKING: Governor @GavinNewsom just signed our groundbreaking AI bill, SB 53, to promote AI innovation (creating a public cloud called CalCompute), require transparency around AI lab safety practices & protect whistleblowers at AI labs who report risk of catastrophic harm. 🧵
75
51
304
Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model’s mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15)
43
174
1K
Anthropic is endorsing SB 53, California Sen. @Scott_Wiener ‘s bill requiring transparency of frontier AI companies. We have long said we would prefer a federal standard. But in the absence of that this creates a solid blueprint for AI governance that cannot be ignored.
20
35
334
We’re hiring someone to run the Anthropic Fellows Program! Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵
10
42
257
EXCLUSIVE: 60 U.K. Parliamentarians Accuse Google of Violating International AI Safety Pledge. The letter, released on August 29 by activist group @PauseAI UK, says that Google’s March release of Gemini 2.5 Pro without details on safety testing “sets a dangerous precedent.”
13
53
176
reAlpha CEO Mike Logozzo (@mike_logozzo) joins @marketopolis_ to share how he’s rethinking the homebuying journey with AI, vertical integration, and a retail-first strategy. He breaks down what it takes to pivot, adapt, and lead through change. 🎧 Listen now! @reAlpha
16
8
15
Anthropic safety teams will be supervising (and hiring) collaborators from this program. We’ll be taken on collaborators to start on safety research projects with us starting in January. Also a great opportunity to work with safety researchers at many other great orgs too!
🚀 Applications now open: Constellation's Astra Fellowship 🚀 We're relaunching Astra — a 3-6 month fellowship to accelerate AI safety research & careers. Alumni @eli_lifland & Romeo Dean co-authored AI 2027 and co-founded @AI_Futures_ with their Astra mentor @DKokotajlo!
1
2
54
It’s rare for competitors to collaborate. Yet that’s exactly what OpenAI and @AnthropicAI just did—by testing each other’s models with our respective internal safety and alignment evaluations. Today, we’re publishing the results. Frontier AI companies will inevitably compete on
openai.com
OpenAI and Anthropic share findings from a first-of-its-kind joint safety evaluation, testing each other’s models for misalignment, instruction following, hallucinations, jailbreaking, and more—hig...
112
387
2K
If you want to get into alignment research, imo this is one of the best ways to do it. Some previous fellows did some of the most interesting research I've seen this year and >20% ended up joining Anthropic full-time. Application deadline is this Sunday!
We’re running another round of the Anthropic Fellows program. If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.
17
18
345
Launching now — a new blog for research from @AnthropicAI’s Frontier Red Team and others. > https://t.co/lRNZmquFBi We’ll be covering our internal research on cyber, bio, autonomy, national security and more.
26
122
946
We’re running another round of the Anthropic Fellows program. If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.
68
237
3K
New Anthropic research: Building and evaluating alignment auditing agents. We developed three AI agents to autonomously complete alignment auditing tasks. In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.
62
198
1K
xAI launched Grok 4 without any documentation of their safety testing. This is reckless and breaks with industry best practices followed by other major AI labs. If xAI is going to be a frontier AI developer, they should act like one. 🧵
261
241
3K