Luke Bailey Profile
Luke Bailey

@LukeBailey181

Followers
387
Following
645
Media
9
Statuses
96

CS PhD student @Stanford. Former CS and Math undergraduate @Harvard.

Joined July 2023
Don't wanna be here? Send us removal request.
@LukeBailey181
Luke Bailey
11 months
Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵
11
85
370
@emmons_scott
Scott Emmons
2 days
CoT monitoring is one of our best shots at AI safety. But it's fragile and could be lost due to RL or architecture changes. Would we even notice if it starts slipping away? 🧵
3
7
64
@tanishqkumar07
Tanishq Kumar
5 days
Please steal my AI research ideas. This is a list of research questions and concrete experiments I would love to see done, but don't have bandwidth to get to. If you are looking to break into AI research (e.g. as an undergraduate, or a software engineer in industry), these are
47
203
2K
@littmath
Daniel Litt
10 months
I've recently been talking a bit about how difficult it is to carefully check even well-written mathematics. I want to try to explain something about this by telling the story of some errors in the literature that (in part) led to the two papers below. 1/n
30
202
1K
@minimario1729
Alex Gu
13 days
✂️Introducing ProofOptimizer: a training and inference recipe for proof shortening! 😰AI-written formal proofs can be long and unreadable: Seed-Prover's proof of IMO '25 P1 is 16x longer in Lean vs. English. Our 7B shortens proofs generated by SoTA models by over 50%! 🧵⬇️
6
36
204
@StephenLCasper
Cas (Stephen Casper)
12 days
Our proposal for new AI watermarking characters for Unicode is officially in the document register for proposed additions. 🤞 https://t.co/ScTDQnhGz3 https://t.co/yJfp8ezU64
4
23
94
@RylanSchaeffer
Rylan Schaeffer
13 days
Why can adv examples transfer between image classifiers & text jailbreaks between language models but image jailbreaks seemingly can't transfer between vision LMs? We have a conjecture + 4 types of prima facie evidence! 1/3
@is_h_a
isha
13 days
New work! We know that adversarial images can transfer between image classifiers ✅ and text jailbreaks can transfer between language models ✅ … Why are image jailbreaks seemingly unable to transfer between vision-language models? ❌ We might know why… 🧵
2
1
17
@hla_michael
Michael Hla
17 days
LLM engineered carbon capture enzymes have officially been produced. The best designs were 170% more active and 25% more stable across extreme pH (Tm +8.5 C). Winning strategies include adapting a tag from a bacterial carbonic anhydrase, beta barrel core packing, and removing
31
80
432
@StephenLCasper
Cas (Stephen Casper)
28 days
@ilex_ulmus @g_leech_ @Thomas_Woodside Since I originally posted this in June, I actually looked further into this and I submitted a proposal to the Unicode Consortium for two new characters. One is for watermarking AI text. The other is for indicating non-consent to AI training.
1
3
24
@MariusHobbhahn
Marius Hobbhahn
1 month
Seeing the CoT of o3 for the first time definitely convinced me that future mitigations should not rely on CoT interpretability. I think more RL will make it harder to interpret, even if we put no other pressure on the CoT.
@apolloaievals
Apollo Research
1 month
While working with OpenAI on testing anti-scheming training, we have found OpenAI o-series models’ raw chain-of-thought incredibly useful for assessing models’ situational awareness, misalignment, and goal-directedness. However, we also found CoT hard to interpret at times.
9
17
213
@jonasgeiping
Jonas Geiping
1 month
Would LLMs ever *lie* to their users to prevent harm, instead of refusing harmful questions? I hope you're not too tired of reading about LLM Deception this week, because here is our report on 𝗦𝘁𝗿𝗮𝘁𝗲𝗴𝗶𝗰 𝗗𝗶𝘀𝗵𝗼𝗻𝗲𝘀𝘁𝘆 in LLMs, and how it complicates Safety Evals:
3
14
95
@LukeBailey181
Luke Bailey
1 month
It'll be interesting to see more work on defenses combining activation and CoT monitors. https://t.co/BOzw7uUHTk has a great methodology for evaluating these kinds of "defense in depth" or "Swiss cheese model."
0
0
0
@LukeBailey181
Luke Bailey
1 month
Another great example of activation monitoring catching things CoT monitors miss. Activation monitors still have their flaws (our prior work on this https://t.co/0K8NHfa6c8), but there is growing evidence that these are different from when CoT monitors mess up.
@jonasgeiping
Jonas Geiping
1 month
However, we find that linear probes of hidden states can detect strategic dishonesty very reliably (and we spend the next several pages of the report convincing ourselves that the probes do work)
1
0
0
@ZitongYang0
Zitong Yang
1 month
📜 Paper on new pretraining paradigm: Synthetic Bootstrapped Pretraining SBP goes beyond next-token supervision in a single document by leveraging inter-document correlations to synthesize new data for training — no teacher needed. Validation: 1T data + 3B model from scratch.🧵
9
50
248
@kothasuhas
Suhas Kotha
1 month
Since compute grows faster than the web, we think the future of pre-training lies in the algorithms that will best leverage ♾ compute We find simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better perf w/ sufficient compute
9
83
442
@StephenLCasper
Cas (Stephen Casper)
2 months
📌📌📌 I'm excited to be on the faculty job market this fall. I updated my website with my CV. https://t.co/4Ddv6tN0jq
Tweet card summary image
stephencasper.com
Visit the post for more.
8
22
172
@tengyuma
Tengyu Ma
2 months
We benchmarked 10 optimizers and found that the recent new optimizers still have limited speed up (~10%) over Adam at a "larger" scale (1.2B, 8x data than Chinchilla optimal). I guess that means more research to be done in this area!
@wen_kaiyue
Kaiyue Wen
2 months
(1/n) Check out our new paper: "Fantastic Pretraining Optimizers and Where to Find Them"! >4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups <0.5B & only 10% at 1.2B (8× Chinchilla)!
8
18
141
@wen_kaiyue
Kaiyue Wen
2 months
(1/n) Check out our new paper: "Fantastic Pretraining Optimizers and Where to Find Them"! >4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups <0.5B & only 10% at 1.2B (8× Chinchilla)!
13
92
434
@StephenLCasper
Cas (Stephen Casper)
3 months
🧵 New paper from @AISecurityInst x @AiEleuther that I led with Kyle O’Brien: Open-weight LLM safety is both important & neglected. But we show that filtering dual-use knowledge from pre-training data improves tamper resistance *>10x* over post-training baselines.
7
41
200
@emmons_scott
Scott Emmons
4 months
Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models
6
40
187
@ChengleiSi
CLS
4 months
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
12
191
633