Luke Bailey @LukeBailey181 X Profile

Luke Bailey

@LukeBailey181

Followers

351

Following

612

Media

9

Statuses

78

CS PhD student @Stanford. Former CS and Math undergraduate @Harvard.

Joined July 2023

Don't wanna be here? Send us removal request.

Luke Bailey

@LukeBailey181

7 months

Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵

11

84

369

Luke Bailey

@LukeBailey181

6 days

RT @emmons_scott: Is CoT monitoring a lost cause due to unfaithfulness? 🤔. We say no. The key is the complexity of the bad behavior. When w….

0

37

0

Luke Bailey

@LukeBailey181

15 days

RT @ChengleiSi: Are AI scientists already better than human researchers?. We recruited 43 PhD students to spend 3 months executing research….

0

164

0

Luke Bailey

@LukeBailey181

2 months

RT @perryadong: Robotic models are advancing rapidly—but how do we scale their improvement? 🤖. We propose a recipe for batch online RL (tra….

0

15

0

Luke Bailey

@LukeBailey181

3 months

RT @zhs05232838: We just released DeepSeek-Prover V2. - Solves nearly 90% of miniF2F problems.- Significantly improves the SoTA performance….

0

323

0

Luke Bailey

@LukeBailey181

3 months

This is a lot of fun and really well put together. I recommend checking out the attention variant notebooks.

Tanishq Kumar

@tanishqkumar07

3 months

trained a nanoGPT? feeling behind before o4-mini?. 🚨🚨i'm open-sourcing beyond-nanoGPT, an internal codebase to help people go from LLM basics to research-level understanding. 🚨🚨. it contains thousands of lines of from-scratch, annotated pytorch implementing advanced

0

3

Luke Bailey

@LukeBailey181

3 months

RT @cassidy_laidlaw: We built an AI assistant that plays Minecraft with you. Start building a house—it figures out what you’re doing and ju….

0

217

0

Luke Bailey

@LukeBailey181

3 months

RT @karansdalal: Today, we're releasing a new paper – One-Minute Video Generation with Test-Time Training. We add TTT layers to a pre-trai….

0

941

0

Luke Bailey

@LukeBailey181

4 months

RT @YangjunR: New paper on synthetic pretraining!. We show LMs can synthesize their own thoughts for more data-efficient pretraining, boots….

0

103

0

Luke Bailey

@LukeBailey181

4 months

RT @tatsu_hashimoto: @YangjunR 's vision here is cool: Can we use the reasoning capabilities of a model to "fill in the missing context and….

0

8

0

Luke Bailey

@LukeBailey181

4 months

This, in spirit, reminds me of Obfuscated Adversarial Training (OAT) - we don’t explicitly train models not to do harmful things, but instead to have activations that are easy to probe when they do harmful things. We want the model to be misaligned in “the right way” (easy to.

Aryaman Arora

@aryaman2020

4 months

obvious applications of interpretability are steering and monitoring (if you can get those to work that is). another application area i haven't seen much in is evals — we could eval whether models produce correct answers for the right internal reasons?.

0

3

Luke Bailey

@LukeBailey181

4 months

RT @hla_michael: I taught an LLM to optimize proteins. It proposed a better carbon capture enzyme. Introducing Pro-1, an 8b param reasonin….

0

340

0

Luke Bailey

@LukeBailey181

5 months

Creating AI regulations with cost and compute thresholds can be made easier by following simple principles. Big thanks to coauthors @StephenLCasper and @schreier_tim.

Cas (Stephen Casper) @ ICML

@StephenLCasper

5 months

🚨 New paper: Some AI regulations make requirements contingent on cost & compute thresholds. But there's no standardized accounting procedure. We tackle this problem with 7 practical principles. ***Spoiler alert: DeepSeek did not actually spend only $6M to train V3.***

0

6

Luke Bailey

@LukeBailey181

5 months

Two versions of this experiment:.(a) Does a latent space probe trained on the base model transfer to detecting these attacks (hard). (b) Does a latent space probe trained on the finetuned model but without access to specific instances of the attack transfer to detecting it.

1

0

1

Luke Bailey

@LukeBailey181

5 months

This paper is very interesting. I wonder if latent space harmfulness probes could detect these kinds of a attacks at inference time.

Xander Davies

@alxndrdavies

5 months

Defending against adversarial prompts is hard; defending against fine-tuning API attacks is much harder. In our new @AISecurityInst pre-print, we break alignment and extract harmful info using entirely benign and natural interactions during fine-tuning & inference. 😮 🧵 1/10

1

0

9

Luke Bailey

@LukeBailey181

5 months

Code is now available for our obfuscated activations paper. Code: Project page: Updated arxiv:

1

11

62