Lee Sharkey
@leedsharkey
Followers
2K
Following
6K
Media
46
Statuses
672
Scruting matrices @ Goodfire | Previously: cofounded Apollo Research
London, UK
Joined March 2015
I will again state my view that condemning bad things is great, but condemning others for failing to condemn bad things, (much less boycotting them and similar glorious loyalty oath crusades,) is building toxic community incentives and attempting to force conformity.
1
4
34
Yeup!
0
0
3
Why use LLM-as-a-judge when you can get the same performance for 15–500x cheaper? Our new research with @RakutenGroup on PII detection finds that SAE probes: - transfer from synthetic to real data better than normal probes - match GPT-5 Mini performance at 1/15 the cost (1/6)
12
49
327
Are you a high-agency, early- to mid-career researcher or engineer who wants to work on AI interpretability? We're looking for several Research Fellows and Research Engineering Fellows to start this fall.
7
17
154
We're excited to announce a collaboration with @MayoClinic! We're working to improve personalized patient outcomes by extracting richer, more reliable signals from genomic & digital pathology models. That could mean novel biomarkers, personalized diagnostics, & more.
3
10
73
Official results are in - Gemini achieved gold-medal level in the International Mathematical Olympiad! 🏆 An advanced version was able to solve 5 out of 6 problems. Incredible progress - huge congrats to @lmthang and the team!
deepmind.google
The International Mathematical Olympiad (“IMO”) is the world’s most prestigious competition for young mathematicians, and has been held annually since 1959. Each country taking part is represented by…
202
760
6K
Who knew you could win gold in the International Math Olympiad without truly reasoning?
36
24
535
Just wrote a piece on why I believe interpretability is AI’s most important frontier - we're building the most powerful technology in history, but still can't reliably engineer or understand our models. With rapidly improving model capabilities, interpretability is more urgent,
1
17
138
We and collaborators have already begun scaling to much larger models, and see some very early signs of life! We think now is a great time for new people to jump on and improve on this method! Work by @BushnaqLucius @danbraunai and me! Links to paper & code below!
1
0
18
While this method overcomes most of the barriers to scaling, we think a few more tweaks will be necessary before we can trust results on large models. But at least now we can indeed scale up to those models and start exploring, even if we're not sure about the results!
1
0
14
We demonstrate this improved stability by replicating (and improving on) the decompositions of our previous paper. We also decompose models that the previous method failed to decompose correctly, such as a TMS model with a hidden identity matrix and a 3-layer residual MLP.
2
0
13
Overall, this is much more stable to train than the top-k approach in the old algorithm, probably because we no longer use gradients to estimate attributions (which are often inaccurate), and because top-k introduced troublesome discontinuities (plus other reasons).
1
0
11
We train the output of the ablated network to be the same as the original network. And as before, we train the subcomponents to all sum together to the parameters of the original network.
1
0
11
But we ablate it by some random amount, where causally unimportant subcomponents can be fully, partially, or not at all ablated (because they shouldn't matter and it shouldn't make a difference!). Whereas unablatable components don't get ablated at all.
1
0
14
The way it works: For a given datapoint, we predict the 'causal importance' of each subcomponent. This estimates how 'ablatable' they are on that datapoint. Then we do another forward pass where we actually ablate each subcomponent!...
1
0
13
The main differences are: - We use 'subcomponents' (rank-one matrices in one layer) instead of components (whole vectors in parameter space) - We learn a simple function that predicts 'causal importances' of each subcomponent.
1
1
17
A few months ago, we published Attribution-based parameter decomposition -- a method for decomposing a network's parameters for interpretability. But it was janky and didn't scale. Today, we published a new, better algorithm called 🔶Stochastic Parameter Decomposition!🔶
5
23
184
Very good. Very fire.
I've joined @GoodfireAI (London team) because I think it's the best place to develop and scale fundamental interpretability techniques. Doing this well requires compute, ambition, and most of all, great people. Goodfire has all of these.
0
0
26
New research update! We replicated @AnthropicAI's circuit tracing methods to test if they can recover a known, simple transformer mechanism.
2
53
502