Julian Michael @_julianmichael_ X Profile

Julian Michael

@_julianmichael_

Followers

2K

Following

780

Media

25

Statuses

403

AI evals, alignment and safety @Meta.

https://t.co/QBTqSScIQQ

San Francisco

Joined July 2018

Don't wanna be here? Send us removal request.

Julian Michael

@_julianmichael_

2 years

As AIs improve at persuasion & argumentation, how do we ensure that they help us seek truth vs. just sounding convincing? In human experiments, we validate debate as a truth-seeking process, showing that it may soon be needed for supervising AI. Paper: https://t.co/zcZZToYWvw

9

41

229

Ari Holtzman

@universeinanegg

21 days

For those who missed it, we just releaaed a little LLM-backed game called HR Simulator™ You play an intern ghostwriting emails for your boss. It’s like you’re stuck in corporate email hell…and you’re the devil 😈 link and an initial answer to “WHY WOULD YOU DO THIS?” below

3

20

53

Julian Michael

@_julianmichael_

25 days

Congrats to @HalcyonFutures! I think what Mike and team have built is legit in the very top few organizations worldwide in securing humanity’s future against AI risk. They’ve helped some super exciting new projects get off the ground, as we can all see for ourselves now :)

Mike McCormick

@MikeMcCormick_

26 days

Exactly two years ago, I launched @HalcyonFutures. So far we’ve seeded and launched 16 new orgs and companies, and helped them raise nearly a quarter billion dollars in funding. Flash back to 2022: After eight years in VC, I stepped back to explore questions about exponential

1

2

9

Mike McCormick

@MikeMcCormick_

26 days

Exactly two years ago, I launched @HalcyonFutures. So far we’ve seeded and launched 16 new orgs and companies, and helped them raise nearly a quarter billion dollars in funding. Flash back to 2022: After eight years in VC, I stepped back to explore questions about exponential

14

25

130

Summer Yue

@summeryue0

26 days

We’re excited to share our preparedness report on Code World Model (CWM), FAIR’s latest open-weight model for code generation and reasoning. This report was developed by the SEAL team and the AI Security team, marking our first external publication since part of SEAL joined Meta

4

16

134

Zifan (Sail) Wang

@_zifan_wang

2 months

🧵 (1/9) New @scale_AI research paper: "Search-Time Data Contamination" (STC), which occurs in evaluating search-based LLM agents when the retrieval step contains clues about a question’s answer by virtue of being derived from the evaluation set itself.

3

20

54

Nathaniel Li

@natliml

2 months

I joined @Meta AI, running preparedness and security evaluations with @summeryue0 and @_julianmichael_ to ensure that Superintelligence's newest models enable a prosperous future. Grateful for the team they built at @Scale_AI and excited for the critical work ahead.

13

6

187

Rune Kvist

@RuneKvist

3 months

To accelerate AI adoption, we need an AI standard. What Moody’s is for bonds, FICO for credit, SOC 2 for security. Standards offer credible signals of who to trust. They create confidence. Confidence accelerates adoption. Introducing AIUC-1: the world’s first AI agent standard

38

31

261

Miles Turpin

@milesaturpin

3 months

New @Scale_AI paper! 🌟 LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).

9

73

284

Julian Michael

@_julianmichael_

3 months

New faithfulness paper! How do we get models to actually explain their reasoning? I think this basically doesn’t happen in CoT by default, and it’s hard to figure out what this should look like in the first place, but even basic techniques show some promise :) see the paper!

Miles Turpin

@milesaturpin

3 months

New @Scale_AI paper! 🌟 LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).

0

1

21

david rein

@idavidrein

3 months

I was pretty skeptical that this study was worth running, because I thought that *obviously* we would see significant speedup. https://t.co/XmlorBSnoM

METR

@METR_Evals

3 months

We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.

29

111

1K

Summer Yue

@summeryue0

4 months

Today is my first day at Meta Superintelligence Labs. I’ll be focusing on alignment and safety, building on my time at Scale Research and SEAL. Grateful to keep working with @alexandr_wang—no one more committed, clear-eyed, or mission-driven. Excited for what’s ahead 🚀

71

32

2K

Julian Michael

@_julianmichael_

4 months

I should probably announce that a few months ago, I joined @scale_AI to lead the Safety, Evaluations, and Alignment Lab… and today, I joined @Meta to continue working on AI alignment with @summeryue0 and @alexandr_wang. Very excited for what we can accomplish together!

15

12

417

Julian Michael

@_julianmichael_

4 months

New adversarial robustness benchmark with harm categories grounded in US and international law!

Christina Knight

@cqknight_

4 months

🧵 (1/5) Powerful LLMs present dual-use opportunities & risks for national security and public safety (NSPS). We are excited to launch FORTRESS, a new SEAL leaderboard for measuring adversarial robustness of model safeguard and over-refusal tailored particularly for NSPS threats.

0

12

Christina Knight

@cqknight_

4 months

🧵 (1/5) Powerful LLMs present dual-use opportunities & risks for national security and public safety (NSPS). We are excited to launch FORTRESS, a new SEAL leaderboard for measuring adversarial robustness of model safeguard and over-refusal tailored particularly for NSPS threats.

1

4

28

Julian Michael

@_julianmichael_

4 months

Read our new position paper on making red teaming research relevant for real systems 👇

Zifan (Sail) Wang

@_zifan_wang

4 months

🧵 (1/6) Bringing together diverse mindsets – from in-the-trenches red teamers to ML & policy researchers, we write a position paper arguing crucial research priorities for red teaming frontier models, followed by a roadmap towards system-level safety, AI monitoring, and

1

2

17

Zifan (Sail) Wang

@_zifan_wang

4 months

🧵 (1/6) Bringing together diverse mindsets – from in-the-trenches red teamers to ML & policy researchers, we write a position paper arguing crucial research priorities for red teaming frontier models, followed by a roadmap towards system-level safety, AI monitoring, and

4

24

90

Epoch AI

@EpochAIResearch

5 months

Is GPQA Diamond tapped out? Recent top scores have clustered around 83%. Could the other 17% of questions be flawed? In this week’s Gradient Update, @GregHBurnham digs into this popular benchmark. His conclusion: reports of its demise are probably premature.

5

23

247

Julian Michael

@_julianmichael_

6 months

We design AIs to be oracles and servants, and then we’re aghast when they read the conversation history and decide we’re narcissists. What exactly did we expect? Then we “solve” this by having AI treat us as narcissists out of the gate? Seems like a move in the wrong direction.

Mikhail Parakhin

@MParakhin

6 months

When we were first shipping Memory, the initial thought was: “Let’s let users see and edit their profiles”. Quickly learned that people are ridiculously sensitive: “Has narcissistic tendencies” - “No I do not!”, had to hide it. Hence this batch of the extreme sycophancy RLHF.

0

5

Yoav Tzfati

@yoavtzfati

6 months

How robust is our AI oversight? 🤔 I just published my MATS 5.0 project, where I explore oversight robustness by training an LLM to give CodeNames clues a bunch of interesting ways and measure how much it reward hacks. Link in thread!

1

5

Micah Carroll

@MicahCarroll

6 months

On the contrary: poisoning human <-> AI trust is good Even though this wasn't OpenAI's intention, grotesquely sycophantic models are ultimately useful for getting everyone to really 'get it': People shouldn't trust AI outputs unconditionally – all models are sycophantic

near

@nearcyan

6 months

openai is single-handedly poisoning the well of human <-> AI trust and wordsmithing. we spent months creating an experience that tries to actually help people now we face an uphill battle because trust has been destroyed. it isn’t coming back, even when 4o is “fixed”. it’s gone

2

8

56