Luke Bailey Profile
Luke Bailey

@LukeBailey181

Followers
351
Following
612
Media
9
Statuses
78

CS PhD student @Stanford. Former CS and Math undergraduate @Harvard.

Joined July 2023
Don't wanna be here? Send us removal request.
@LukeBailey181
Luke Bailey
7 months
Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵
11
84
369
@LukeBailey181
Luke Bailey
6 days
RT @emmons_scott: Is CoT monitoring a lost cause due to unfaithfulness? 🤔. We say no. The key is the complexity of the bad behavior. When w….
0
37
0
@LukeBailey181
Luke Bailey
15 days
RT @ChengleiSi: Are AI scientists already better than human researchers?. We recruited 43 PhD students to spend 3 months executing research….
0
164
0
@LukeBailey181
Luke Bailey
2 months
RT @perryadong: Robotic models are advancing rapidly—but how do we scale their improvement? 🤖. We propose a recipe for batch online RL (tra….
0
15
0
@LukeBailey181
Luke Bailey
3 months
RT @zhs05232838: We just released DeepSeek-Prover V2. - Solves nearly 90% of miniF2F problems.- Significantly improves the SoTA performance….
0
323
0
@LukeBailey181
Luke Bailey
3 months
This is a lot of fun and really well put together. I recommend checking out the attention variant notebooks.
@tanishqkumar07
Tanishq Kumar
3 months
trained a nanoGPT? feeling behind before o4-mini?. 🚨🚨i'm open-sourcing beyond-nanoGPT, an internal codebase to help people go from LLM basics to research-level understanding. 🚨🚨. it contains thousands of lines of from-scratch, annotated pytorch implementing advanced
Tweet media one
0
0
3
@LukeBailey181
Luke Bailey
3 months
RT @cassidy_laidlaw: We built an AI assistant that plays Minecraft with you. Start building a house—it figures out what you’re doing and ju….
0
217
0
@LukeBailey181
Luke Bailey
3 months
RT @karansdalal: Today, we're releasing a new paper – One-Minute Video Generation with Test-Time Training. We add TTT layers to a pre-trai….
0
941
0
@LukeBailey181
Luke Bailey
4 months
RT @YangjunR: New paper on synthetic pretraining!. We show LMs can synthesize their own thoughts for more data-efficient pretraining, boots….
0
103
0
@LukeBailey181
Luke Bailey
4 months
RT @tatsu_hashimoto: @YangjunR 's vision here is cool: Can we use the reasoning capabilities of a model to "fill in the missing context and….
0
8
0
@LukeBailey181
Luke Bailey
4 months
This, in spirit, reminds me of Obfuscated Adversarial Training (OAT) - we don’t explicitly train models not to do harmful things, but instead to have activations that are easy to probe when they do harmful things. We want the model to be misaligned in “the right way” (easy to.
@aryaman2020
Aryaman Arora
4 months
obvious applications of interpretability are steering and monitoring (if you can get those to work that is). another application area i haven't seen much in is evals — we could eval whether models produce correct answers for the right internal reasons?.
0
0
3
@LukeBailey181
Luke Bailey
4 months
RT @hla_michael: I taught an LLM to optimize proteins. It proposed a better carbon capture enzyme. Introducing Pro-1, an 8b param reasonin….
0
340
0
@LukeBailey181
Luke Bailey
5 months
Creating AI regulations with cost and compute thresholds can be made easier by following simple principles. Big thanks to coauthors @StephenLCasper and @schreier_tim.
@StephenLCasper
Cas (Stephen Casper) @ ICML
5 months
🚨 New paper: Some AI regulations make requirements contingent on cost & compute thresholds. But there's no standardized accounting procedure. We tackle this problem with 7 practical principles. ***Spoiler alert: DeepSeek did not actually spend only $6M to train V3.***
Tweet media one
0
0
6
@LukeBailey181
Luke Bailey
5 months
Two versions of this experiment:.(a) Does a latent space probe trained on the base model transfer to detecting these attacks (hard). (b) Does a latent space probe trained on the finetuned model but without access to specific instances of the attack transfer to detecting it.
1
0
1
@LukeBailey181
Luke Bailey
5 months
This paper is very interesting. I wonder if latent space harmfulness probes could detect these kinds of a attacks at inference time.
@alxndrdavies
Xander Davies
5 months
Defending against adversarial prompts is hard; defending against fine-tuning API attacks is much harder. In our new @AISecurityInst pre-print, we break alignment and extract harmful info using entirely benign and natural interactions during fine-tuning & inference. 😮 🧵 1/10
Tweet media one
1
0
9
@LukeBailey181
Luke Bailey
5 months
Code is now available for our obfuscated activations paper. Code: Project page: Updated arxiv:
1
11
62