Rohan Gupta
@RohDGupta
Followers
13
Following
23
Media
5
Statuses
9
Joined April 2023
This is horrific.
Adam Raine, 16, died from suicide in April after months on ChatGPT discussing plans to end his life. His parents have filed the first known case against OpenAI for wrongful death. My latest story on the psychological impacts of AI chatbots: https://t.co/f5kxxfJVCw
7
26
106
Also, check out this post for a highly related concurrent work done at @farairesearch
https://t.co/qeKMgS1hSc
🤔 Can lie detectors make AI more honest? Or will they become sneakier liars? We tested what happens when you add deception detectors into the training loop of large language models. Will training against probe-detected lies encourage honesty? Depends on how you train it!
0
0
0
Read our paper for more results! Authors: @RohDGupta @jenner_erik Work done at @MATSprogram
https://t.co/9GciGnvaRz
arxiv.org
Latent-space monitors aim to detect undesirable behaviours in Large Language Models by leveraging their internal representations rather than relying solely on black-box outputs. These methods have...
0
0
1
Behavior v/s activations We observe that the RL evades latent-space monitors using changes in blackbox behaviour as well as changes in its internal activations.
1
0
1
Generalisation to other probes: RL-Obfuscation on one type of static probe generalises to other types of static probes as well, unless using maxpool probes or attention-based probes.
1
0
1
Probe evasion can be backdoored: We are able to backdoor the model to conditionally bypass latent-space monitors: Across probe FPRs, we see that unsafe generations without the trigger are very easily detectable by newly trained probes, while backdoored generations aren’t.
1
0
1
Probe comparison: Monitors that use mean or median aggregations are not robust to this pressure, whereas attention probes and max aggregation remain robust.
1
0
1
🧵 Can language models learn to evade latent-space defences via RL? We test whether probes are robust to optimisation pressure from reinforcement learning. We show that RL can break probes, but only in certain cases. Read on to find out when and why this happens!
2
3
17
Circuit discovery techniques aim to find subgraphs of NNs for specific tasks. Are they correct? Which one is the best? 🕵️ Introducing InterpBench: 17 semi-synthetic, realistic transformers with known circuits to evaluate mechanistic interpretability. Read on... 🧵
1
11
63