Alex Serrano Profile
Alex Serrano

@sertealex

Followers
34
Following
17
Media
0
Statuses
8

AI research | Prev. @MATSProgram 9.0, Research Intern @CHAI_Berkeley @Google

Joined July 2024
Don't wanna be here? Send us removal request.
@sertealex
Alex Serrano
11 months
🚨 NEW PAPER: Obfuscated Activations Bypass LLM Latent-Space Defenses We find that we can reshape activations while inducing unwanted behavior from an LLM. This lets us attack and fool latent-space defenses, from SAEs and Probes to Circuit Breakers. Find more details here: 🧵
@LukeBailey181
Luke Bailey
11 months
Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵
1
0
7
@imwendering
Wen X.
17 days
🔔New paper: Can reasoning models hide their reasoning? We stress-tested Chain-of-Thought (CoT) monitoring and found that while monitors detect ~96% of hidden malicious intent under normal conditions, ⚠️detection can collapse to ~10% under strong obfuscation pressure. 🧵
9
3
15
@emmons_scott
Scott Emmons
4 months
Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models
6
40
187
@RohDGupta
Rohan Gupta
5 months
🧵 Can language models learn to evade latent-space defences via RL? We test whether probes are robust to optimisation pressure from reinforcement learning. We show that RL can break probes, but only in certain cases. Read on to find out when and why this happens!
2
3
17
@LukeBailey181
Luke Bailey
9 months
Code is now available for our obfuscated activations paper. Code: https://t.co/Y2ngrB87nK Project page: https://t.co/st09KOaBCc Updated arxiv:
1
12
63
@LukeBailey181
Luke Bailey
10 months
If you want to learn more about obfuscated activations, check out the @CogRev_Podcast episode about the paper. Huge thanks to @labenz for having us on the podcast! Paper link:
Tweet card summary image
arxiv.org
Recent latent-space monitoring techniques have shown promise as defenses against LLM attacks. These defenses act as scanners that seek to detect harmful activations before they lead to undesirable...
@labenz
Nathan Labenz
10 months
"We tried to preserve a model's behavior while changing the representation. We found lots of different activations that point to the same behavior." @emmons_scott, @LukeBailey181 & @jenner_erik discuss their new paper on Obfuscated Activation Attacks🧵 https://t.co/GWPe2R6b35
0
1
4
@sertealex
Alex Serrano
11 months
This work was produced in collaboration with @LukeBailey181, @abhayesian, Mike Seleznyov, Jordan Taylor, @jenner_erik, Jacob Hilton, @StephenLCasper, @guestrin, and @emmons_scott!
1
0
0