leedsharkey Profile Banner
Lee Sharkey Profile
Lee Sharkey

@leedsharkey

Followers
2K
Following
6K
Media
46
Statuses
665

Scruting matrices @ Goodfire | Previously: cofounded Apollo Research

London, UK
Joined March 2015
Don't wanna be here? Send us removal request.
@leedsharkey
Lee Sharkey
1 month
RT @demishassabis: Official results are in - Gemini achieved gold-medal level in the International Mathematical Olympiad! 🏆 An advanced ver….
Tweet card summary image
deepmind.google
Our advanced model officially achieved a gold-medal level performance on problems from the International Mathematical Olympiad (IMO), the world’s most prestigious competition for young...
0
761
0
@leedsharkey
Lee Sharkey
1 month
RT @liron: Who knew you could win gold in the International Math Olympiad without truly reasoning?.
0
24
0
@grok
Grok
6 days
What do you want to know?.
547
343
2K
@leedsharkey
Lee Sharkey
1 month
RT @ericho_goodfire: Just wrote a piece on why I believe interpretability is AI’s most important frontier - we're building the most powerfu….
0
17
0
@leedsharkey
Lee Sharkey
2 months
We and collaborators have already begun scaling to much larger models, and see some very early signs of life!. We think now is a great time for new people to jump on and improve on this method! . Work by @BushnaqLucius @danbraunai and me!. Links to paper & code below!.
1
0
18
@leedsharkey
Lee Sharkey
2 months
While this method overcomes most of the barriers to scaling, we think a few more tweaks will be necessary before we can trust results on large models. But at least now we can indeed scale up to those models and start exploring, even if we're not sure about the results!.
1
0
13
@leedsharkey
Lee Sharkey
2 months
We demonstrate this improved stability by replicating (and improving on) the decompositions of our previous paper. We also decompose models that the previous method failed to decompose correctly, such as a TMS model with a hidden identity matrix and a 3-layer residual MLP.
Tweet media one
2
0
12
@leedsharkey
Lee Sharkey
2 months
Overall, this is much more stable to train than the top-k approach in the old algorithm, probably because we no longer use gradients to estimate attributions (which are often inaccurate), and because top-k introduced troublesome discontinuities (plus other reasons).
1
0
10
@leedsharkey
Lee Sharkey
2 months
We train the output of the ablated network to be the same as the original network. And as before, we train the subcomponents to all sum together to the parameters of the original network.
1
0
10
@leedsharkey
Lee Sharkey
2 months
But we ablate it by some random amount, where causally unimportant subcomponents can be fully, partially, or not at all ablated (because they shouldn't matter and it shouldn't make a difference!). Whereas unablatable components don't get ablated at all.
1
0
13
@leedsharkey
Lee Sharkey
2 months
The way it works: . For a given datapoint, we predict the 'causal importance' of each subcomponent. This estimates how 'ablatable' they are on that datapoint. Then we do another forward pass where we actually ablate each subcomponent!. .
1
0
12
@leedsharkey
Lee Sharkey
2 months
The main differences are:.- We use 'subcomponents' (rank-one matrices in one layer) instead of components (whole vectors in parameter space).- We learn a simple function that predicts 'causal importances' of each subcomponent.
1
1
16
@leedsharkey
Lee Sharkey
2 months
A few months ago, we published . Attribution-based parameter decomposition -- a method for decomposing a network's parameters for interpretability. But it was janky and didn't scale. Today, we published a new, better algorithm called . 🔶Stochastic Parameter Decomposition!🔶
Tweet media one
4
21
179
@leedsharkey
Lee Sharkey
2 months
Very good. Very fire.
@danbraunai
Dan Braun
2 months
I've joined @GoodfireAI (London team) because I think it's the best place to develop and scale fundamental interpretability techniques. Doing this well requires compute, ambition, and most of all, great people. Goodfire has all of these.
0
0
27
@leedsharkey
Lee Sharkey
3 months
RT @GoodfireAI: New research update! We replicated @AnthropicAI's circuit tracing methods to test if they can recover a known, simple trans….
0
53
0
@leedsharkey
Lee Sharkey
3 months
I had a lot of fun chatting with Daniel on the AXRP podcast! . We chatted about our ongoing interpretability research agenda, which started with Attribution-based Parameter Decomposition. Also lol "SAE killer" - how far we've come! 😂.
@dfrsrchtwts
Daniel Filan
3 months
New episode with @leedsharkey on his new line of research, APD! I hope you'll enjoy listening as much as I enjoyed recording it :) Video link in reply.
Tweet media one
0
4
36
@leedsharkey
Lee Sharkey
3 months
RT @aryaman2020: @GoodfireAI very good and very fire.
0
1
0
@leedsharkey
Lee Sharkey
3 months
Painting with interpretability tools is very fun!.
@GoodfireAI
Goodfire
3 months
We created a canvas that plugs into an image model’s brain. You can use it to generate images in real-time by painting with the latent concepts the model has learned. Try out Paint with Ember for yourself 👇
Tweet media one
Tweet media two
Tweet media three
Tweet media four
0
2
39
@leedsharkey
Lee Sharkey
3 months
RT @danielmurfet: A few months ago I resigned from my tenured position at the University of Melbourne and joined Timaeus as Director of Res….
0
24
0
@leedsharkey
Lee Sharkey
3 months
A great new resource for mech interp research!.
@danbraunai
Dan Braun
3 months
Introducing SimpleStories: A synthetic story dataset and model suite designed for understanding the internals and learning dynamics of LMs. It's an evolution from TinyStories and leverages better LMs for data generation and offers more data diversity. 🧵
Tweet media one
Tweet media two
0
1
34