Alex Serrano @sertealex X Profile

Alex Serrano

@sertealex

Followers

34

Following

17

Media

0

Statuses

8

AI research | Prev. @MATSProgram 9.0, Research Intern @CHAI_Berkeley @Google

Joined July 2024

Don't wanna be here? Send us removal request.

Alex Serrano

@sertealex

11 months

🚨 NEW PAPER: Obfuscated Activations Bypass LLM Latent-Space Defenses We find that we can reshape activations while inducing unwanted behavior from an LLM. This lets us attack and fool latent-space defenses, from SAEs and Probes to Circuit Breakers. Find more details here: 🧵

Luke Bailey

@LukeBailey181

11 months

Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵

1

0

7

Wen X.

@imwendering

17 days

🔔New paper: Can reasoning models hide their reasoning? We stress-tested Chain-of-Thought (CoT) monitoring and found that while monitors detect ~96% of hidden malicious intent under normal conditions, ⚠️detection can collapse to ~10% under strong obfuscation pressure. 🧵

9

3

15

Scott Emmons

@emmons_scott

4 months

Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models

6

40

187

Rohan Gupta

@RohDGupta

5 months

🧵 Can language models learn to evade latent-space defences via RL? We test whether probes are robust to optimisation pressure from reinforcement learning. We show that RL can break probes, but only in certain cases. Read on to find out when and why this happens!

2

3

17

Luke Bailey

@LukeBailey181

9 months

Code is now available for our obfuscated activations paper. Code: https://t.co/Y2ngrB87nK Project page: https://t.co/st09KOaBCc Updated arxiv:

1

12

63

Luke Bailey

@LukeBailey181

10 months

If you want to learn more about obfuscated activations, check out the @CogRev_Podcast episode about the paper. Huge thanks to @labenz for having us on the podcast! Paper link:

arxiv.org

Recent latent-space monitoring techniques have shown promise as defenses against LLM attacks. These defenses act as scanners that seek to detect harmful activations before they lead to undesirable...

Nathan Labenz

@labenz

10 months

"We tried to preserve a model's behavior while changing the representation. We found lots of different activations that point to the same behavior." @emmons_scott, @LukeBailey181 & @jenner_erik discuss their new paper on Obfuscated Activation Attacks🧵 https://t.co/GWPe2R6b35

0

1

4

Alex Serrano

@sertealex

11 months

@LukeBailey181 @abhayesian @jenner_erik @StephenLCasper @guestrin @emmons_scott Paper: https://t.co/RS1jTD35mw Website:

arxiv.org

Recent latent-space monitoring techniques have shown promise as defenses against LLM attacks. These defenses act as scanners that seek to detect harmful activations before they lead to undesirable...

0

Alex Serrano

@sertealex

11 months

This work was produced in collaboration with @LukeBailey181, @abhayesian, Mike Seleznyov, Jordan Taylor, @jenner_erik, Jacob Hilton, @StephenLCasper, @guestrin, and @emmons_scott!

1

0