Vikrant Varma @VikrantVarma_ X Profile

Vikrant Varma

@VikrantVarma_

Followers

642

Following

39

Media

7

Statuses

20

Research Engineer working on AI alignment at DeepMind.

Joined July 2023

Don't wanna be here? Send us removal request.

Vikrant Varma

@VikrantVarma_

7 months

RT @rohinmshah: We're hiring! Join an elite team that sets an AGI safety approach for all of Google -- both through development and impleme….

0

37

0

Vikrant Varma

@VikrantVarma_

7 months

RT @vkrakovna: We are excited to release a short course on AGI safety! The course offers a concise and accessible introduction to AI alignm….

deepmindsafetyresearch.medium.com

We are excited to release a short course on AGI safety for students, researchers and professionals interested in this topic. The course…

0

49

0

Grok

@grok

27 days

The most fun image & video creation tool in the world is here. Try it for free in the Grok App.

0

198

2K

Vikrant Varma

@VikrantVarma_

8 months

RT @davlindner: New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward?. Ou….

0

98

0

Vikrant Varma

@VikrantVarma_

8 months

RL training can incentivise LLM agents to produce long-term alien plans, and evade monitoring. But in high-stakes settings, comprehensibility is critical. Our new paper shows how to change an agent’s incentives to *only* act in ways that we can understand.

David Lindner

@davlindner

8 months

New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward?. Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them!. Inspired by myopic optimization but better performance – details in🧵

1

7

Vikrant Varma

@VikrantVarma_

1 year

Excited to see what people try with these shiny new open source SAEs! Great work by @sen_r and the team on pushing SOTA here.

Neel Nanda

@NeelNanda5

1 year

New GDM mech interp paper led by @sen_r: JumpReLU SAEs a new SOTA SAE method! We replace standard ReLUs with discontinuous JumpReLUs & train directly for L0 with straight-through estimators. We'll soon release hundreds of open JumpReLU SAEs on Gemma 2, apply now for early access!

1

0

8

Vikrant Varma

@VikrantVarma_

1 year

Very cool find by @sen_r, @ArthurConmy, and the rest of the DeepMind mechinterp team! I’m excited by the rate of progress here.

Senthooran Rajamanoharan

@sen_r

1 year

New @GoogleDeepMind MechInterp work! We introduce Gated SAEs, a Pareto improvement over existing sparse autoencoders. They find equally good reconstructions with around half as many firing features, while maintaining interpretability (CI 0-13% improvement). Joint w/ @ArthurConmy

0

6

Vikrant Varma

@VikrantVarma_

1 year

I had fun talking to Daniel on his podcast AXRP! And I’ve enjoyed listening to his other episodes too :).

Daniel Filan

@dfrsrchtwts

1 year

New episode of AXRP with @VikrantVarma_! We chat about his work on CCS and grokking. The transcript is in the linked tweet, or check out the reply for the YouTube video!.

0

Vikrant Varma

@VikrantVarma_

2 years

Our latest paper shows that unsupervised methods on LLM activations don’t yet discover latent knowledge. Many things can satisfy knowledge-like properties besides ground truth. E.g a strongly opinionated character causes ~half the probes to detect *her* beliefs instead.

Zac Kenton

@ZacKenton1

2 years

In our new @GoogleDeepMind paper, we redteam methods that aim to discover latent knowledge through unsupervised learning from LLM activation data. TL;DR: Existing methods can be easily distracted by other salient features in the prompt. 🧵👇

1

2

21

Vikrant Varma

@VikrantVarma_

2 years

@rohinmshah @ZacKenton1 @JanosKramar @davisblalock perhaps of interest.

0

7

Vikrant Varma

@VikrantVarma_

2 years

Thanks to my joint first author @rohinmshah, and my coauthors @ZacKenton1, @JanosKramar, and Ramana Kumar!.

1

15

Vikrant Varma

@VikrantVarma_

2 years

There's still much we don't know. Why is the generalising circuit learned slower? Why does grokking happen in the absence of weight decay? Why don't we see grokking in typical ML training?. Check out our paper for speculation on these and much more

arxiv.org

One of the most surprising puzzles in neural network generalisation is grokking: a network with perfect training accuracy but poor generalisation will, upon further training, transition to perfect...

4

2

46

Vikrant Varma

@VikrantVarma_

2 years

Prediction 3: Training runs with different amounts of weight decay should converge to the same test accuracy – since test accuracy at convergence depends on the ratio of Gen to Mem, which depends only on dataset size

3

2

29

Vikrant Varma

@VikrantVarma_

2 years

Prediction 2: Remember how Mem is really efficient at small dataset sizes? That suggests that, if you train a grokked network further on a really small subset of the data, the network should switch back to Mem, "ungrokking" to poor test accuracy

2

0

44

Vikrant Varma

@VikrantVarma_

2 years

Prediction 1: there is a dataset size where Mem and Gen have similar efficiencies. If we train for long enough at that size, sometimes we should get a mix with similar proportions of Mem and Gen – resulting in "semi-grokking" to partial test accuracy 🤯

1

31

Vikrant Varma

@VikrantVarma_

2 years

The strength of a scientific explanation is its ability to make interesting and novel predictions in new settings! Can our explanation make some striking new predictions? 🔬.

1

0

27

Vikrant Varma

@VikrantVarma_

2 years

Why does the network bother memorising then? We hypothesise that generalisation is learned slowly. That gives us three ingredients which together are sufficient for grokking: (1) two circuits, Mem and Gen, (2) Gen is more efficient, (3) Gen is learned more slowly.

2

3

41

Vikrant Varma

@VikrantVarma_

2 years

Answer: Gen is more *efficient*: it turns the same parameter norm into higher outputs than memorisation (and higher outputs = more confident predictions = lower loss). Mem is super efficient on small datasets, but Gen scales better with more data, and wins the efficiency race 🏎️

3

4

46

Vikrant Varma

@VikrantVarma_

2 years

In grokking, a neural network first learns a memorising circuit "Mem" that memorises the training dataset, but with further training it switches to a generalising circuit "Gen". Key Q: why does the network ever change from Mem, which achieves near-perfect training loss?

1

2

52

Vikrant Varma

@VikrantVarma_

2 years

Our latest paper ( provides a general theory explaining when and why grokking (aka delayed generalisation) occurs – a theory so precise that we can predict hyperparameters that lead to partial grokking, and design interventions that reverse grokking! 🧵👇

15

197

1K