Alex Serrano
@sertealex
Followers
34
Following
17
Media
0
Statuses
8
AI research | Prev. @MATSProgram 9.0, Research Intern @CHAI_Berkeley @Google
Joined July 2024
🚨 NEW PAPER: Obfuscated Activations Bypass LLM Latent-Space Defenses We find that we can reshape activations while inducing unwanted behavior from an LLM. This lets us attack and fool latent-space defenses, from SAEs and Probes to Circuit Breakers. Find more details here: 🧵
Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵
1
0
7
🔔New paper: Can reasoning models hide their reasoning? We stress-tested Chain-of-Thought (CoT) monitoring and found that while monitors detect ~96% of hidden malicious intent under normal conditions, ⚠️detection can collapse to ~10% under strong obfuscation pressure. 🧵
9
3
15
Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models
6
40
187
🧵 Can language models learn to evade latent-space defences via RL? We test whether probes are robust to optimisation pressure from reinforcement learning. We show that RL can break probes, but only in certain cases. Read on to find out when and why this happens!
2
3
17
Code is now available for our obfuscated activations paper. Code: https://t.co/Y2ngrB87nK Project page: https://t.co/st09KOaBCc Updated arxiv:
1
12
63
If you want to learn more about obfuscated activations, check out the @CogRev_Podcast episode about the paper. Huge thanks to @labenz for having us on the podcast! Paper link:
arxiv.org
Recent latent-space monitoring techniques have shown promise as defenses against LLM attacks. These defenses act as scanners that seek to detect harmful activations before they lead to undesirable...
"We tried to preserve a model's behavior while changing the representation. We found lots of different activations that point to the same behavior." @emmons_scott, @LukeBailey181 & @jenner_erik discuss their new paper on Obfuscated Activation Attacks🧵 https://t.co/GWPe2R6b35
0
1
4
@LukeBailey181 @abhayesian @jenner_erik @StephenLCasper @guestrin @emmons_scott Paper: https://t.co/RS1jTD35mw Website:
arxiv.org
Recent latent-space monitoring techniques have shown promise as defenses against LLM attacks. These defenses act as scanners that seek to detect harmful activations before they lead to undesirable...
0
0
0
This work was produced in collaboration with @LukeBailey181, @abhayesian, Mike Seleznyov, Jordan Taylor, @jenner_erik, Jacob Hilton, @StephenLCasper, @guestrin, and @emmons_scott!
1
0
0