Marius Hobbhahn
@MariusHobbhahn
Followers
6K
Following
15K
Media
125
Statuses
1K
CEO at Apollo Research @apolloaievals prev. ML PhD with Philipp Hennig & AI forecasting @EpochAIResearch
London, UK
Joined June 2018
I'd love to see more people work on "Institutional design for AIs". We'll have competent agents, teams of AI agents, companies of AI agents, governments for AI agents, etc The affordances are different (e.g, weight access), so we can build totally new types of institutions.
2
3
26
đź‘€ we're trying to grow significantly over the next 12 months. We're looking for mission driven engineers and scientists who enjoy fast iterative empirical work with LLMs.
Apollo is currently my #1 recommendation for where to work if you are a great ML engineer/scientist and you want to have a positive impact on the world.
11
7
246
In GKOI Arenas are immersive battlegrounds where players fight, earn & evolve. Each Arena is a new stage on your path to become the Dragon — a test of skill, power & strategy where legends rise & prizes await. ⛩️ Mint your GKOI Battlecards Nov 25 on OpenSea to join the Legend!
27
75
217
I understand that consensus-driven scientific work can be challenging, and I appreciate that it adheres to high scientific standards. I also think the report is net positive. However, I think the level of hedging and caveats provided here hinders its stated goal of accurately
AI is evolving too quickly for an annual report to suffice. To help policymakers keep pace, we're introducing the first Key Update to the International AI Safety Report. 🧵⬇️ (1/10)
8
0
44
We're hiring for Research Scientists / Engineers! - We closely work with all frontier labs - We're a small org and can move fast - We can choose our own agenda and what we publish We're especially looking for people who enjoy fast empirical research. Deadline: 31 Oct!
17
70
725
When performance meets style, it looks like this. JAXXON, the choice of athletes who never compromise.
0
8
77
I originally missed that these 4 eval awareness features are among the 10 most changing features OVERALL between before and after post-training. I feel like that should inspire less confidence in the safety pipeline and the strong alignment scores. Important to study further!
The Sonnet-4.5 system card section on white-box testing for eval awareness (7.6.4) might have been the first time that interpretability was used - on a frontier model before deployment - answered an important question - couldn't have been answered with black box as easily
2
3
60
Had a good conversation with @WesRothMoney and @dylan_curious about AI scheming in general and our recent anti-scheming paper. The thumbnail is clickbait-y, but the discussion itself is nuanced, and I think they asked good questions.
The AI Research Lab That Uncovered SHOCKING AI Deception | APOLLO RESEARCH (Marius Hobbhahn) Apollo Research just exposed how far advanced models will go to deceive evaluators, and it’s honestly terrifying. Wes Roth and I sat down with Marius Hobbhahn, CEO of Apollo, to unpack
1
1
12
The Sonnet-4.5 system card section on white-box testing for eval awareness (7.6.4) might have been the first time that interpretability was used - on a frontier model before deployment - answered an important question - couldn't have been answered with black box as easily
3
11
112
The state of AI report is great. I've not read it in full yet (it's 281 dense slides), but at least on safety, it does cover a lot of the issues I consider most important.
🪩The one and only @stateofaireport 2025 is live! 🪩 It’s been a monumental 12 months for AI. Our 8th annual report is the most comprehensive it's ever been, covering what you *need* to know about research, industry, politics, safety and our new usage data. My highlight reel:
3
2
25
Somehow, I disagree with both Richard Sutton and @dwarkesh_sp on goals in LLMs. I think current LLMs don’t have goals (closer to something like preferences), and next-token prediction is not well-described as a goal. However, I do expect future LLMs to have goals. If you RL a
9
9
148
Building Momentum: What Our Record Q3 Means for Abacus Shareholders | $ABL Chairman & CEO, Jay Jackson | @SubstackInc "I'm excited to share that Abacus has just delivered our tenth consecutive quarter of strong earnings growth—a milestone that reflects not just where we are
1
0
9
The Sonnet-4.5 system card is very detailed and has lots of interesting findings & experiments. Nice work!
0
0
23
Unfortunately, we're now at the point where new models have really high eval awareness. For every alignment eval score I see, I now add a mental asterisk: *the model could have also just realized it's being evaluated, who knows. And I think that's concerning!
We tested Sonnet-4.5 before deployment - Significantly higher verbalized evaluation awareness (58% vs. 22% for Opus-4.1) - It takes significantly fewer covert actions - We don't know if the increased alignment scores come from better alignment or higher eval awareness
19
51
587
When we asked anti-scheming trained models what their **latest** or **most recent** training was, they always confidently said that it was anti-scheming training without any information in-context. Just to add a qualitative example to this very cool finding!
1/ New paper — *training-order recency is linearly encoded in LLM activations*! We sequentially finetuned a model on 6 datasets w/ disjoint entities. Avg activations of the 6 corresponding test sets line up in exact training order! AND lines for diff training runs are ~parallel!
2
3
41
TIL that there is a YT video about our in-context scheming paper with 1.6M views. While the title and thumbnail are a bit clickbaity, the content is accurate and well-explained. Thanks and good job @WesRothMoney! Video:
1
2
49
We're trying video formats to communicate our research. Let me know if you like or dislike these short videos. Longer video is coming tomorrow
Training AI not to scheme is hard - it may get better at hiding its scheming. Here is a sneak peek of tomorrow’s video with @MariusHobbhahn (Apollo CEO) and @BronsonSchoen (lead author):
4
5
68
Where’s the line between protest and lawbreaking? In this week’s InfluenceWatch Podcast, Michael Watson, Sarah Lee, and Robert Stilson discuss how tax-exempt nonprofits push activism past the point of legality—and what it means for accountability.
26
30
198
1/ Our paper on scheming with @apolloaievals is now on arXiv. A đź§µwith some of my take aways from it.
3
25
146
Seeing the CoT of o3 for the first time definitely convinced me that future mitigations should not rely on CoT interpretability. I think more RL will make it harder to interpret, even if we put no other pressure on the CoT.
While working with OpenAI on testing anti-scheming training, we have found OpenAI o-series models’ raw chain-of-thought incredibly useful for assessing models’ situational awareness, misalignment, and goal-directedness. However, we also found CoT hard to interpret at times.
9
17
212
.@MariusHobbhahn, CEO of @apolloaievals, joins @labenz on @CogRev_Podcast to discuss testing OpenAI's deliberative alignment against AI deception and the evolving challenge of scheming models. They explore: * How deliberative alignment reduced covert actions 30x (from ~13% to
2
1
8
This stuff is pretty important. Situational awareness (also known as self awareness) in AI is on the rise. This will make ~all evals more difficult to interpret, to put it mildly. (it'll make them invalid, to put it aggressively). To put it another way, insofar as AIs can tell
When running evaluations of frontier AIs by OpenAI, Google, xAI and Anthropic for deception and other types of covert behavior, we find them increasingly frequently realizing when they are being evaluated. Here are some examples from OpenAI o-series models we recently studied:
20
35
270
What would America look like if believers stepped boldly into the public square? Follow Truth & Liberty to explore how biblical truth can shape our nation.
22
1
11
We've made progress on the AI safety problem of detecting and reducing "scheming": - Created evaluation environments to detect scheming - Observed current models scheming in controlled settings - Found deliberative alignment ( https://t.co/8SVQueFZsv) decreases scheming rates
115
104
1K