Sam Bowman @sleepinyourhat X Profile

Sam Bowman

@sleepinyourhat

Followers

51K

Following

73K

Media

131

Statuses

3K

AI alignment + LLMs at Anthropic. On leave from NYU. Views not employers'. No relation to @s8mb. I think you should join @givingwhatwecan.

https://t.co/PX6a5bUbSV

San Francisco

Joined July 2011

Don't wanna be here? Send us removal request.

Sam Bowman

@sleepinyourhat

21 days

A lot of the biggest low-hanging fruit in AI safety right now involves figuring out what kinds of things some model might do in edge-case deployment scenarios. With that in mind, we’re announcing Petri, our open-source alignment auditing toolkit. (🧵)

8

30

220

Marius Hobbhahn

@MariusHobbhahn

6 days

The Sonnet-4.5 system card section on white-box testing for eval awareness (7.6.4) might have been the first time that interpretability was used - on a frontier model before deployment - answered an important question - couldn't have been answered with black box as easily

3

11

107

Ryan Greenblatt

@RyanPGreenblatt

12 days

Anthropic has now clarified this in their system card for Claude Haiku 4.5. Thanks!

Ryan Greenblatt

@RyanPGreenblatt

17 days

Anthropic, GDM, and xAI say nothing about whether they train against Chain-of-Thought (CoT) while OpenAI claims they don't. AI companies should be transparent about whether (and how) they train against CoT. While OpenAI is doing better, all AI companies should say more. 1/

5

21

316

Sam Bowman

@sleepinyourhat

12 days

As we’ve discussed in recent system cards, we don’t have great metrics for subtler forms of unfaithfulness, and we’re working on (and looking for) better options.

0

1

24

Sam Bowman

@sleepinyourhat

12 days

…we see _no_ instances of clear unfaithfulness from any of the recent models, before or after the change. That said, that doesn’t mean that there’s no difference at all:

1

16

Sam Bowman

@sleepinyourhat

12 days

From what we can observe, removing these RL incentives didn’t substantially change model’s reasoning: We don’t see major qualitative changes in reasoning or reasoning faithfulness, and…

1

17

Sam Bowman

@sleepinyourhat

12 days

…we used some SL data from these earlier models to train newer models, so there is still some indirect influence from those older RL methods on newer models.

1

19

Sam Bowman

@sleepinyourhat

12 days

In earlier reasoning models, through Opus 4.1, we had some incentives in RL training that could penalize harmful content in scratchpads. Sonnet and Haiku 4.5 eliminated these RL incentives, though…

1

22

Sam Bowman

@sleepinyourhat

12 days

We’ve been steadily ratcheting up how much we disclose in our system cards, and this time, we’re introducing some coverage of how we’ve handled this in model training to date. https://t.co/OigWJHdONR

1

0

19

Sam Bowman

@sleepinyourhat

12 days

This comes with some tricky tradeoffs, especially if you want to be able to safely expose reasoning to users and developers as a debugging tool.

1

0

13

Sam Bowman

@sleepinyourhat

12 days

Preserving faithfulness is, as far as we currently understand, largely a matter of minimizing the pressure that is applied to it during training.

1

0

16

Sam Bowman

@sleepinyourhat

12 days

We don’t think that it plays a critical safety role for our current models, but we think it’s nonetheless a potentially valuable trait we’d like to preserve.

1

0

16

Sam Bowman

@sleepinyourhat

12 days

The issue of faithfulness (or, relatedly, monitorability) has been a major topic in discussions of AI alignment this year. https://t.co/rAAejV2Rt2

Mikita Balesni 🇺🇦

@balesni

3 months

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:

1

0

21

Sam Bowman

@sleepinyourhat

12 days

🧵 Haiku 4.5 🧵 Looking at the alignment evidence, Haiku is similar to Sonnet: Very safe, though often eval-aware. I think the most interesting alignment content in the system card is about reasoning faithfulness…

2

11

68

Ethan Perez

@EthanJPerez

16 days

Forethought’s work has been very influential in my own research. Consider applying if you’re interested in the kind of work they do!

William MacAskill

@willmacaskill

16 days

Forethought is hiring! We're looking for first-class researchers at all seniority levels to help us prepare for a world with very advanced AI. Please apply!

1

5

67

Misha Wagner

@mishajw126

18 days

New Anthropic Alignment blog: Can LLMs learn to reason *subtly* about malign goals? We find that Sonnet 3.7 can’t use its reasoning to create undetected backdoors without its reasoning appearing suspicious, even when trained against these monitors in RL. 🧵

1

3

30

Samuel Marks

@saprmarks

19 days

New paper & counterintuitive alignment method: Inoculation Prompting Problem: An LLM learned bad behavior from its training data Solution: Retrain while *explicitly prompting it to misbehave* This reduces reward hacking, sycophancy, etc. without harming learning of capabilities

15

70

530

Trenton Bricken

@TrentonBricken

19 days

Anthropic is hosting a Research Salon on Thurs Oct 23! @_sholtodouglas and I will be doing a fireside chat with @StuartJRitchie. We've reserved a few extra spots for interested folks to come. If you'd like to join, please fill out this form. We'll be in touch if we're able to

anthropic.swoogo.com

8

178

Samuel Marks

@saprmarks

20 days

You can read here about why I'm especially excited about work on developing new affordances for auditing agents and demonstrating that they improve performance

lesswrong.com

Thanks to Rowan Wang and Buck Shlegeris for feedback on a draft. …

0

1

10

Samuel Marks

@saprmarks

20 days

Very exciting: Anthropic is releasing an open-source version of an alignment auditing agent we use internally. Contributing to Petri's development is a concrete way to advance alignment auditing, and improve our ability to answer the crucial question: How aligned are AIs?

Anthropic

@AnthropicAI

21 days

Last week we released Claude Sonnet 4.5. As part of our alignment testing, we used a new tool to run automated audits for behaviors like sycophancy and deception. Now we’re open-sourcing the tool to run those audits.

2

37

Sam Bowman

@sleepinyourhat

21 days

@kaifronsdal @MATSprogram @McaleerStephen @sprice354_ (Isha is @is_h_a!)

0

4