Luke Bailey
@LukeBailey181
Followers
387
Following
645
Media
9
Statuses
96
CS PhD student @Stanford. Former CS and Math undergraduate @Harvard.
Joined July 2023
Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵
11
85
370
CoT monitoring is one of our best shots at AI safety. But it's fragile and could be lost due to RL or architecture changes. Would we even notice if it starts slipping away? 🧵
3
7
64
Please steal my AI research ideas. This is a list of research questions and concrete experiments I would love to see done, but don't have bandwidth to get to. If you are looking to break into AI research (e.g. as an undergraduate, or a software engineer in industry), these are
47
203
2K
I've recently been talking a bit about how difficult it is to carefully check even well-written mathematics. I want to try to explain something about this by telling the story of some errors in the literature that (in part) led to the two papers below. 1/n
30
202
1K
✂️Introducing ProofOptimizer: a training and inference recipe for proof shortening! 😰AI-written formal proofs can be long and unreadable: Seed-Prover's proof of IMO '25 P1 is 16x longer in Lean vs. English. Our 7B shortens proofs generated by SoTA models by over 50%! 🧵⬇️
6
36
204
Our proposal for new AI watermarking characters for Unicode is officially in the document register for proposed additions. 🤞 https://t.co/ScTDQnhGz3
https://t.co/yJfp8ezU64
4
23
94
Why can adv examples transfer between image classifiers & text jailbreaks between language models but image jailbreaks seemingly can't transfer between vision LMs? We have a conjecture + 4 types of prima facie evidence! 1/3
New work! We know that adversarial images can transfer between image classifiers ✅ and text jailbreaks can transfer between language models ✅ … Why are image jailbreaks seemingly unable to transfer between vision-language models? ❌ We might know why… 🧵
2
1
17
LLM engineered carbon capture enzymes have officially been produced. The best designs were 170% more active and 25% more stable across extreme pH (Tm +8.5 C). Winning strategies include adapting a tag from a bacterial carbonic anhydrase, beta barrel core packing, and removing
31
80
432
@ilex_ulmus @g_leech_ @Thomas_Woodside Since I originally posted this in June, I actually looked further into this and I submitted a proposal to the Unicode Consortium for two new characters. One is for watermarking AI text. The other is for indicating non-consent to AI training.
1
3
24
Seeing the CoT of o3 for the first time definitely convinced me that future mitigations should not rely on CoT interpretability. I think more RL will make it harder to interpret, even if we put no other pressure on the CoT.
While working with OpenAI on testing anti-scheming training, we have found OpenAI o-series models’ raw chain-of-thought incredibly useful for assessing models’ situational awareness, misalignment, and goal-directedness. However, we also found CoT hard to interpret at times.
9
17
213
Would LLMs ever *lie* to their users to prevent harm, instead of refusing harmful questions? I hope you're not too tired of reading about LLM Deception this week, because here is our report on 𝗦𝘁𝗿𝗮𝘁𝗲𝗴𝗶𝗰 𝗗𝗶𝘀𝗵𝗼𝗻𝗲𝘀𝘁𝘆 in LLMs, and how it complicates Safety Evals:
3
14
95
It'll be interesting to see more work on defenses combining activation and CoT monitors. https://t.co/BOzw7uUHTk has a great methodology for evaluating these kinds of "defense in depth" or "Swiss cheese model."
0
0
0
Another great example of activation monitoring catching things CoT monitors miss. Activation monitors still have their flaws (our prior work on this https://t.co/0K8NHfa6c8), but there is growing evidence that these are different from when CoT monitors mess up.
However, we find that linear probes of hidden states can detect strategic dishonesty very reliably (and we spend the next several pages of the report convincing ourselves that the probes do work)
1
0
0
📜 Paper on new pretraining paradigm: Synthetic Bootstrapped Pretraining SBP goes beyond next-token supervision in a single document by leveraging inter-document correlations to synthesize new data for training — no teacher needed. Validation: 1T data + 3B model from scratch.🧵
9
50
248
Since compute grows faster than the web, we think the future of pre-training lies in the algorithms that will best leverage ♾ compute We find simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better perf w/ sufficient compute
9
83
442
📌📌📌 I'm excited to be on the faculty job market this fall. I updated my website with my CV. https://t.co/4Ddv6tN0jq
stephencasper.com
Visit the post for more.
8
22
172
We benchmarked 10 optimizers and found that the recent new optimizers still have limited speed up (~10%) over Adam at a "larger" scale (1.2B, 8x data than Chinchilla optimal). I guess that means more research to be done in this area!
(1/n) Check out our new paper: "Fantastic Pretraining Optimizers and Where to Find Them"! >4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups <0.5B & only 10% at 1.2B (8× Chinchilla)!
8
18
141
(1/n) Check out our new paper: "Fantastic Pretraining Optimizers and Where to Find Them"! >4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups <0.5B & only 10% at 1.2B (8× Chinchilla)!
13
92
434
🧵 New paper from @AISecurityInst x @AiEleuther that I led with Kyle O’Brien: Open-weight LLM safety is both important & neglected. But we show that filtering dual-use knowledge from pre-training data improves tamper resistance *>10x* over post-training baselines.
7
41
200
Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models
6
40
187
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
12
191
633