VikrantVarma_ Profile Banner
Vikrant Varma Profile
Vikrant Varma

@VikrantVarma_

Followers
642
Following
39
Media
7
Statuses
20

Research Engineer working on AI alignment at DeepMind.

Joined July 2023
Don't wanna be here? Send us removal request.
@VikrantVarma_
Vikrant Varma
7 months
RT @rohinmshah: We're hiring! Join an elite team that sets an AGI safety approach for all of Google -- both through development and impleme….
0
37
0
@VikrantVarma_
Vikrant Varma
7 months
RT @vkrakovna: We are excited to release a short course on AGI safety! The course offers a concise and accessible introduction to AI alignm….
Tweet card summary image
deepmindsafetyresearch.medium.com
We are excited to release a short course on AGI safety for students, researchers and professionals interested in this topic. The course…
0
49
0
@grok
Grok
27 days
The most fun image & video creation tool in the world is here. Try it for free in the Grok App.
0
198
2K
@VikrantVarma_
Vikrant Varma
8 months
RT @davlindner: New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward?. Ou….
0
98
0
@VikrantVarma_
Vikrant Varma
8 months
RL training can incentivise LLM agents to produce long-term alien plans, and evade monitoring. But in high-stakes settings, comprehensibility is critical. Our new paper shows how to change an agent’s incentives to *only* act in ways that we can understand.
@davlindner
David Lindner
8 months
New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward?. Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them!. Inspired by myopic optimization but better performance – details in🧵
Tweet media one
1
1
7
@VikrantVarma_
Vikrant Varma
1 year
Excited to see what people try with these shiny new open source SAEs! Great work by @sen_r and the team on pushing SOTA here.
@NeelNanda5
Neel Nanda
1 year
New GDM mech interp paper led by @sen_r: JumpReLU SAEs a new SOTA SAE method! We replace standard ReLUs with discontinuous JumpReLUs & train directly for L0 with straight-through estimators. We'll soon release hundreds of open JumpReLU SAEs on Gemma 2, apply now for early access!
Tweet media one
1
0
8
@VikrantVarma_
Vikrant Varma
1 year
Very cool find by @sen_r, @ArthurConmy, and the rest of the DeepMind mechinterp team! I’m excited by the rate of progress here.
@sen_r
Senthooran Rajamanoharan
1 year
New @GoogleDeepMind MechInterp work! We introduce Gated SAEs, a Pareto improvement over existing sparse autoencoders. They find equally good reconstructions with around half as many firing features, while maintaining interpretability (CI 0-13% improvement). Joint w/ @ArthurConmy
Tweet media one
0
0
6
@VikrantVarma_
Vikrant Varma
1 year
I had fun talking to Daniel on his podcast AXRP! And I’ve enjoyed listening to his other episodes too :).
@dfrsrchtwts
Daniel Filan
1 year
New episode of AXRP with @VikrantVarma_! We chat about his work on CCS and grokking. The transcript is in the linked tweet, or check out the reply for the YouTube video!.
0
0
0
@VikrantVarma_
Vikrant Varma
2 years
Our latest paper shows that unsupervised methods on LLM activations don’t yet discover latent knowledge. Many things can satisfy knowledge-like properties besides ground truth. E.g a strongly opinionated character causes ~half the probes to detect *her* beliefs instead.
@ZacKenton1
Zac Kenton
2 years
In our new @GoogleDeepMind paper, we redteam methods that aim to discover latent knowledge through unsupervised learning from LLM activation data. TL;DR: Existing methods can be easily distracted by other salient features in the prompt. 🧵👇
Tweet media one
1
2
21
@VikrantVarma_
Vikrant Varma
2 years
0
0
7
@VikrantVarma_
Vikrant Varma
2 years
Thanks to my joint first author @rohinmshah, and my coauthors @ZacKenton1, @JanosKramar, and Ramana Kumar!.
1
1
15
@VikrantVarma_
Vikrant Varma
2 years
There's still much we don't know. Why is the generalising circuit learned slower? Why does grokking happen in the absence of weight decay? Why don't we see grokking in typical ML training?. Check out our paper for speculation on these and much more
Tweet card summary image
arxiv.org
One of the most surprising puzzles in neural network generalisation is grokking: a network with perfect training accuracy but poor generalisation will, upon further training, transition to perfect...
4
2
46
@VikrantVarma_
Vikrant Varma
2 years
Prediction 3: Training runs with different amounts of weight decay should converge to the same test accuracy – since test accuracy at convergence depends on the ratio of Gen to Mem, which depends only on dataset size
Tweet media one
3
2
29
@VikrantVarma_
Vikrant Varma
2 years
Prediction 2: Remember how Mem is really efficient at small dataset sizes? That suggests that, if you train a grokked network further on a really small subset of the data, the network should switch back to Mem, "ungrokking" to poor test accuracy
Tweet media one
2
0
44
@VikrantVarma_
Vikrant Varma
2 years
Prediction 1: there is a dataset size where Mem and Gen have similar efficiencies. If we train for long enough at that size, sometimes we should get a mix with similar proportions of Mem and Gen – resulting in "semi-grokking" to partial test accuracy 🤯
Tweet media one
1
1
31
@VikrantVarma_
Vikrant Varma
2 years
The strength of a scientific explanation is its ability to make interesting and novel predictions in new settings! Can our explanation make some striking new predictions? 🔬.
1
0
27
@VikrantVarma_
Vikrant Varma
2 years
Why does the network bother memorising then? We hypothesise that generalisation is learned slowly. That gives us three ingredients which together are sufficient for grokking: (1) two circuits, Mem and Gen, (2) Gen is more efficient, (3) Gen is learned more slowly.
Tweet media one
Tweet media two
Tweet media three
2
3
41
@VikrantVarma_
Vikrant Varma
2 years
Answer: Gen is more *efficient*: it turns the same parameter norm into higher outputs than memorisation (and higher outputs = more confident predictions = lower loss). Mem is super efficient on small datasets, but Gen scales better with more data, and wins the efficiency race 🏎️
Tweet media one
Tweet media two
3
4
46
@VikrantVarma_
Vikrant Varma
2 years
In grokking, a neural network first learns a memorising circuit "Mem" that memorises the training dataset, but with further training it switches to a generalising circuit "Gen". Key Q: why does the network ever change from Mem, which achieves near-perfect training loss?
Tweet media one
1
2
52
@VikrantVarma_
Vikrant Varma
2 years
Our latest paper ( provides a general theory explaining when and why grokking (aka delayed generalisation) occurs – a theory so precise that we can predict hyperparameters that lead to partial grokking, and design interventions that reverse grokking! 🧵👇
Tweet media one
Tweet media two
15
197
1K