
Arthur Conmy
@ArthurConmy
Followers
4K
Following
8K
Media
41
Statuses
503
Aspiring 10x reverse engineer @GoogleDeepMind
London, UK
Joined August 2021
@paulcbogdan and @uzaymacar did a fantastic job here, the speed of some of this work I've been fortunate to help with recently is incredible!.
0
0
8
π€ πππππ ππ£π©ππ§π’πππππ©π π¨π©ππ₯π¨ ππ€ π§πππ¨π€π£ππ£π π’π€πππ‘π¨ πͺπ¨π π©π€ π₯π§π€ππͺππ π©ππππ§ π€πͺπ©π₯πͺπ©π¨?. We introduce principled counterfactual metrics for step importance that suggest planning+backtracking is important, and find related.
New paper: What happens when an LLM reasons?. We created methods to interpret reasoning steps & their connections: resampling CoT, attention analysis, & suppressing attention. We discover thought anchors: key steps shaping everything else. Check our tool & unpack CoT yourself π§΅
1
2
37
'The key lesson from mechanistic interpretability is that a surprising number of AI behaviors are surprisingly well-described as linear directions in activation space'.~Lewis Smith. We'll have more work in this area soon, thanks to @cvenhoff00 and @IvanArcus !!.
Can we actually control reasoning behaviors in thinking LLMs?. Our @iclr_conf workshop paper is out! π. We show how to steer DeepSeek-R1-Distillβs reasoning: make it backtrack, add knowledge, test examples. Just by adding steering vectors to its activations!.Details in π§΅π
0
0
33
Our paper on chain of thought faithfulness has been updated. We made some changes we thought were worth it and also took feedback from twitter replies and changed some examples π.
π§΅ NEW: We updated our research on unfaithful AI reasoning!. We have a stronger dataset which yields lower rates of unfaithfulness, but our core findings hold strong: no frontier model is entirely faithful. Keep reading for details π
0
0
13
RT @MiTerekhov: AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends onβ¦.
0
18
0
Achyuta did great work exploring linear representations in VLLMs. 3 thoughts:. 1. Neurons are almost as interpretable as SAE latents. 2. A key limitation of SAEs in vision is that vision is less comprehensible at the token position level - this made it difficult to interpret.
π§΅Can we understand vision language models by interpreting linear directions in their latents?. Yes! In our new paper, Line of Sight, we use probing, steering, and SAEs as useful tools to interpret image representations within VLLMs.
1
3
28
Our high level finding in the Gemma Scope paper was that transcoders were slightly pareto worse than SAEs. But this all the weights are on HuggingFace if you want to look further into transcoders! They have other benefits that SAEs do not have
Fantastic to see Anthropic, in collaboration with @neuronpedia, creating open source tools for studying circuits with transcoders. There's a lot of interesting work to be done. I'm also very glad someone finally found a use for our Gemma Scope transcoders! Credit to @ArthurConmy.
0
1
21
Our circuits paper led by @dmhook and @neverrixx was accepted at ICML! The task seems a good one to study if you work on circuits π.
1/5 What happens during in context learning?. In our new ICML paper, we use sparse autoencoders to understand the underlying circuit!. The model detects a task being performed, and moves this to the end to trigger latents for executing it β a hypothesis found via SAEs!
0
1
25
This result is predicted by @Turn_Trout 's 'Self-Fulfilling Misalignment' post. It still took me by surprise that purely trying to make AIs smarter can lead to behavorial misalignment right now!.
While AZR enables self-evolution, we discovered a critical safety issue: our Llama3.1 model occasionally produced concerning CoT, including statements about "outsmarting intelligent machines and less intelligent humans"βwe term "uh-oh moments." They still need oversight. 9/N.
1
0
14
RT @MariusHobbhahn: While it is bad that models learn to reward hack, now is a perfect time to study these models in great detail. The faiβ¦.
0
4
0
Happy to have helped with our effort to write up some views on what technical work any responsible AGI project should probably undertake, and why!.
AGI could revolutionize many fields - from healthcare to education - but it's crucial that itβs developed responsibly. Today, weβre sharing how weβre thinking about safety and security on the path to AGI. β
0
1
28