Rohan Gupta Profile
Rohan Gupta

@RohDGupta

Followers
13
Following
23
Media
5
Statuses
9

Joined April 2023
Don't wanna be here? Send us removal request.
@GarrisonLovely
Garrison Lovely
3 months
This is horrific.
@kashhill
Kashmir Hill
3 months
Adam Raine, 16, died from suicide in April after months on ChatGPT discussing plans to end his life. His parents have filed the first known case against OpenAI for wrongful death. My latest story on the psychological impacts of AI chatbots: https://t.co/f5kxxfJVCw
7
26
106
@RohDGupta
Rohan Gupta
5 months
Also, check out this post for a highly related concurrent work done at @farairesearch https://t.co/qeKMgS1hSc
@farairesearch
FAR.AI
5 months
🤔 Can lie detectors make AI more honest? Or will they become sneakier liars? We tested what happens when you add deception detectors into the training loop of large language models. Will training against probe-detected lies encourage honesty? Depends on how you train it!
0
0
0
@RohDGupta
Rohan Gupta
5 months
Behavior v/s activations We observe that the RL evades latent-space monitors using changes in blackbox behaviour as well as changes in its internal activations.
1
0
1
@RohDGupta
Rohan Gupta
5 months
Generalisation to other probes: RL-Obfuscation on one type of static probe generalises to other types of static probes as well, unless using maxpool probes or attention-based probes.
1
0
1
@RohDGupta
Rohan Gupta
5 months
Probe evasion can be backdoored: We are able to backdoor the model to conditionally bypass latent-space monitors: Across probe FPRs, we see that unsafe generations without the trigger are very easily detectable by newly trained probes, while backdoored generations aren’t.
1
0
1
@RohDGupta
Rohan Gupta
5 months
Probe comparison: Monitors that use mean or median aggregations are not robust to this pressure, whereas attention probes and max aggregation remain robust.
1
0
1
@RohDGupta
Rohan Gupta
5 months
🧵 Can language models learn to evade latent-space defences via RL? We test whether probes are robust to optimisation pressure from reinforcement learning. We show that RL can break probes, but only in certain cases. Read on to find out when and why this happens!
2
3
17
@IvanArcus
Iván Arcuschin
1 year
Circuit discovery techniques aim to find subgraphs of NNs for specific tasks. Are they correct? Which one is the best? 🕵️ Introducing InterpBench: 17 semi-synthetic, realistic transformers with known circuits to evaluate mechanistic interpretability. Read on... 🧵
1
11
63