Marius Hobbhahn @MariusHobbhahn X Profile

Marius Hobbhahn

@MariusHobbhahn

Followers

6K

Following

15K

Media

125

Statuses

1K

CEO at Apollo Research @apolloaievals prev. ML PhD with Philipp Hennig & AI forecasting @EpochAIResearch

https://t.co/sttcoHQzBD

London, UK

Joined June 2018

Don't wanna be here? Send us removal request.

Marius Hobbhahn

@MariusHobbhahn

10 days

I'd love to see more people work on "Institutional design for AIs". We'll have competent agents, teams of AI agents, companies of AI agents, governments for AI agents, etc The affordances are different (e.g, weight access), so we can build totally new types of institutions.

2

3

26

Marius Hobbhahn

@MariusHobbhahn

11 days

👀 we're trying to grow significantly over the next 12 months. We're looking for mission driven engineers and scientists who enjoy fast iterative empirical work with LLMs.

Daniel Kokotajlo

@DKokotajlo

12 days

Apollo is currently my #1 recommendation for where to work if you are a great ML engineer/scientist and you want to have a positive impact on the world.

11

7

246

GKOI

@gkoi0x

2 days

In GKOI Arenas are immersive battlegrounds where players fight, earn & evolve. Each Arena is a new stage on your path to become the Dragon — a test of skill, power & strategy where legends rise & prizes await. ⛩️ Mint your GKOI Battlecards Nov 25 on OpenSea to join the Legend!

27

75

217

Marius Hobbhahn

@MariusHobbhahn

16 days

I understand that consensus-driven scientific work can be challenging, and I appreciate that it adheres to high scientific standards. I also think the report is net positive. However, I think the level of hedging and caveats provided here hinders its stated goal of accurately

Yoshua Bengio

@Yoshua_Bengio

24 days

AI is evolving too quickly for an annual report to suffice. To help policymakers keep pace, we're introducing the first Key Update to the International AI Safety Report. 🧵⬇️ (1/10)

8

0

44

Marius Hobbhahn

@MariusHobbhahn

16 days

https://t.co/jEXIhC65lr

jobs.lever.co

Application deadline: We're currently considering applications on a rolling basis. It can take multiple weeks until we respond, even if you are a great fit. ABOUT THE OPPORTUNITY We’re looking for...

1

0

31

Marius Hobbhahn

@MariusHobbhahn

16 days

We're hiring for Research Scientists / Engineers! - We closely work with all frontier labs - We're a small org and can move fast - We can choose our own agenda and what we publish We're especially looking for people who enjoy fast empirical research. Deadline: 31 Oct!

17

70

725

JAXXON

@Jaxxonjewelry

10 days

When performance meets style, it looks like this. JAXXON, the choice of athletes who never compromise.

0

8

77

Marius Hobbhahn

@MariusHobbhahn

17 days

I originally missed that these 4 eval awareness features are among the 10 most changing features OVERALL between before and after post-training. I feel like that should inspire less confidence in the safety pipeline and the strong alignment scores. Important to study further!

Marius Hobbhahn

@MariusHobbhahn

18 days

The Sonnet-4.5 system card section on white-box testing for eval awareness (7.6.4) might have been the first time that interpretability was used - on a frontier model before deployment - answered an important question - couldn't have been answered with black box as easily

2

3

60

Marius Hobbhahn

@MariusHobbhahn

18 days

Had a good conversation with @WesRothMoney and @dylan_curious about AI scheming in general and our recent anti-scheming paper. The thumbnail is clickbait-y, but the discussion itself is nuanced, and I think they asked good questions.

Dylan Curious | AI News & Analysis

@dylan_curious

23 days

The AI Research Lab That Uncovered SHOCKING AI Deception | APOLLO RESEARCH (Marius Hobbhahn) Apollo Research just exposed how far advanced models will go to deceive evaluators, and it’s honestly terrifying. Wes Roth and I sat down with Marius Hobbhahn, CEO of Apollo, to unpack

1

12

Marius Hobbhahn

@MariusHobbhahn

18 days

The Sonnet-4.5 system card section on white-box testing for eval awareness (7.6.4) might have been the first time that interpretability was used - on a frontier model before deployment - answered an important question - couldn't have been answered with black box as easily

3

11

112

Marius Hobbhahn

@MariusHobbhahn

28 days

The state of AI report is great. I've not read it in full yet (it's 281 dense slides), but at least on safety, it does cover a lot of the issues I consider most important.

Nathan Benaich

@nathanbenaich

1 month

🪩The one and only @stateofaireport 2025 is live! 🪩 It’s been a monumental 12 months for AI. Our 8th annual report is the most comprehensive it's ever been, covering what you *need* to know about research, industry, politics, safety and our new usage data. My highlight reel:

3

2

25

Marius Hobbhahn

@MariusHobbhahn

1 month

Somehow, I disagree with both Richard Sutton and @dwarkesh_sp on goals in LLMs. I think current LLMs don’t have goals (closer to something like preferences), and next-token prediction is not well-described as a goal. However, I do expect future LLMs to have goals. If you RL a

9

148

Abacus Global Management

@AbacusGM

1 day

Building Momentum: What Our Record Q3 Means for Abacus Shareholders | $ABL Chairman & CEO, Jay Jackson | @SubstackInc "I'm excited to share that Abacus has just delivered our tenth consecutive quarter of strong earnings growth—a milestone that reflects not just where we are

1

0

9

Marius Hobbhahn

@MariusHobbhahn

1 month

The Sonnet-4.5 system card is very detailed and has lots of interesting findings & experiments. Nice work!

0

23

Marius Hobbhahn

@MariusHobbhahn

1 month

Unfortunately, we're now at the point where new models have really high eval awareness. For every alignment eval score I see, I now add a mental asterisk: *the model could have also just realized it's being evaluated, who knows. And I think that's concerning!

Apollo Research

@apolloaievals

1 month

We tested Sonnet-4.5 before deployment - Significantly higher verbalized evaluation awareness (58% vs. 22% for Opus-4.1) - It takes significantly fewer covert actions - We don't know if the increased alignment scores come from better alignment or higher eval awareness

19

51

587

Marius Hobbhahn

@MariusHobbhahn

1 month

When we asked anti-scheming trained models what their **latest** or **most recent** training was, they always confidently said that it was anti-scheming training without any information in-context. Just to add a qualitative example to this very cool finding!

Dima Krasheninnikov

@dmkrash

1 month

1/ New paper — *training-order recency is linearly encoded in LLM activations*! We sequentially finetuned a model on 6 datasets w/ disjoint entities. Avg activations of the 6 corresponding test sets line up in exact training order! AND lines for diff training runs are ~parallel!

2

3

41

Marius Hobbhahn

@MariusHobbhahn

1 month

TIL that there is a YT video about our in-context scheming paper with 1.6M views. While the title and thumbnail are a bit clickbaity, the content is accurate and well-explained. Thanks and good job @WesRothMoney! Video:

1

2

49

Marius Hobbhahn

@MariusHobbhahn

2 months

We're trying video formats to communicate our research. Let me know if you like or dislike these short videos. Longer video is coming tomorrow

Apollo Research

@apolloaievals

2 months

Training AI not to scheme is hard - it may get better at hiding its scheming. Here is a sneak peek of tomorrow’s video with @MariusHobbhahn (Apollo CEO) and @BronsonSchoen (lead author):

4

5

68

Capital Research Center

@capitalresearch

4 days

Where’s the line between protest and lawbreaking? In this week’s InfluenceWatch Podcast, Michael Watson, Sarah Lee, and Robert Stilson discuss how tax-exempt nonprofits push activism past the point of legality—and what it means for accountability.

26

30

198

Boaz Barak

@boazbaraktcs

2 months

1/ Our paper on scheming with @apolloaievals is now on arXiv. A 🧵with some of my take aways from it.

3

25

146

Marius Hobbhahn

@MariusHobbhahn

2 months

Seeing the CoT of o3 for the first time definitely convinced me that future mitigations should not rely on CoT interpretability. I think more RL will make it harder to interpret, even if we put no other pressure on the CoT.

Apollo Research

@apolloaievals

2 months

While working with OpenAI on testing anti-scheming training, we have found OpenAI o-series models’ raw chain-of-thought incredibly useful for assessing models’ situational awareness, misalignment, and goal-directedness. However, we also found CoT hard to interpret at times.

9

17

212

The Cognitive Revolution Podcast

@CogRev_Podcast

2 months

.@MariusHobbhahn, CEO of @apolloaievals, joins @labenz on @CogRev_Podcast to discuss testing OpenAI's deliberative alignment against AI deception and the evolving challenge of scheming models. They explore: * How deliberative alignment reduced covert actions 30x (from ~13% to

2

1

8

Daniel Kokotajlo

@DKokotajlo

2 months

This stuff is pretty important. Situational awareness (also known as self awareness) in AI is on the rise. This will make ~all evals more difficult to interpret, to put it mildly. (it'll make them invalid, to put it aggressively). To put it another way, insofar as AIs can tell

Apollo Research

@apolloaievals

2 months

When running evaluations of frontier AIs by OpenAI, Google, xAI and Anthropic for deception and other types of covert behavior, we find them increasingly frequently realizing when they are being evaluated. Here are some examples from OpenAI o-series models we recently studied:

20

35

270

Truth & Liberty

@truthlibertyco

2 months

What would America look like if believers stepped boldly into the public square? Follow Truth & Liberty to explore how biblical truth can shape our nation.

22

1

11

Greg Brockman

@gdb

2 months

We've made progress on the AI safety problem of detecting and reducing "scheming": - Created evaluation environments to detect scheming - Observed current models scheming in controlled settings - Found deliberative alignment ( https://t.co/8SVQueFZsv) decreases scheming rates

115

104

1K