Rohan Gupta @RohDGupta X Profile

Rohan Gupta

@RohDGupta

Followers

13

Following

23

Media

5

Statuses

9

Joined April 2023

Don't wanna be here? Send us removal request.

Garrison Lovely

@GarrisonLovely

3 months

This is horrific.

Kashmir Hill

@kashhill

3 months

Adam Raine, 16, died from suicide in April after months on ChatGPT discussing plans to end his life. His parents have filed the first known case against OpenAI for wrongful death. My latest story on the psychological impacts of AI chatbots: https://t.co/f5kxxfJVCw

7

26

106

Rohan Gupta

@RohDGupta

5 months

Also, check out this post for a highly related concurrent work done at @farairesearch https://t.co/qeKMgS1hSc

FAR.AI

@farairesearch

5 months

🤔 Can lie detectors make AI more honest? Or will they become sneakier liars? We tested what happens when you add deception detectors into the training loop of large language models. Will training against probe-detected lies encourage honesty? Depends on how you train it!

0

Rohan Gupta

@RohDGupta

5 months

Read our paper for more results! Authors: @RohDGupta @jenner_erik Work done at @MATSprogram https://t.co/9GciGnvaRz

arxiv.org

Latent-space monitors aim to detect undesirable behaviours in Large Language Models by leveraging their internal representations rather than relying solely on black-box outputs. These methods have...

0

1

Rohan Gupta

@RohDGupta

5 months

Behavior v/s activations We observe that the RL evades latent-space monitors using changes in blackbox behaviour as well as changes in its internal activations.

1

0

1

Rohan Gupta

@RohDGupta

5 months

Generalisation to other probes: RL-Obfuscation on one type of static probe generalises to other types of static probes as well, unless using maxpool probes or attention-based probes.

1

0

1

Rohan Gupta

@RohDGupta

5 months

Probe evasion can be backdoored: We are able to backdoor the model to conditionally bypass latent-space monitors: Across probe FPRs, we see that unsafe generations without the trigger are very easily detectable by newly trained probes, while backdoored generations aren’t.

1

0

1

Rohan Gupta

@RohDGupta

5 months

Probe comparison: Monitors that use mean or median aggregations are not robust to this pressure, whereas attention probes and max aggregation remain robust.

1

0

1

Rohan Gupta

@RohDGupta

5 months

🧵 Can language models learn to evade latent-space defences via RL? We test whether probes are robust to optimisation pressure from reinforcement learning. We show that RL can break probes, but only in certain cases. Read on to find out when and why this happens!

2

3

17

Iván Arcuschin

@IvanArcus

1 year

Circuit discovery techniques aim to find subgraphs of NNs for specific tasks. Are they correct? Which one is the best? 🕵️ Introducing InterpBench: 17 semi-synthetic, realistic transformers with known circuits to evaluate mechanistic interpretability. Read on... 🧵

1

11

63