ArthurConmy Profile Banner
Arthur Conmy Profile
Arthur Conmy

@ArthurConmy

Followers
4K
Following
8K
Media
41
Statuses
503

Aspiring 10x reverse engineer @GoogleDeepMind

London, UK
Joined August 2021
Don't wanna be here? Send us removal request.
@ArthurConmy
Arthur Conmy
7 days
Credit to @NeelNanda5 for this one.
1
0
9
@ArthurConmy
Arthur Conmy
7 days
I would strongly recommend You never notice how frequently typing is aversive (e.g. this person will not want to receive emails with typos, must be careful) . until you just talk to your computer and let LLMs deal with typos.
7
5
92
@ArthurConmy
Arthur Conmy
7 days
@paulcbogdan and @uzaymacar did a fantastic job here, the speed of some of this work I've been fortunate to help with recently is incredible!.
0
0
8
@ArthurConmy
Arthur Conmy
7 days
πŸ€” π™’π™π™žπ™˜π™ π™žπ™£π™©π™šπ™§π™’π™šπ™™π™žπ™–π™©π™š π™¨π™©π™šπ™₯𝙨 𝙙𝙀 π™§π™šπ™–π™¨π™€π™£π™žπ™£π™œ π™’π™€π™™π™šπ™‘π™¨ π™ͺπ™¨π™š 𝙩𝙀 π™₯𝙧𝙀𝙙π™ͺπ™˜π™š π™©π™π™šπ™žπ™§ 𝙀π™ͺ𝙩π™₯π™ͺ𝙩𝙨?. We introduce principled counterfactual metrics for step importance that suggest planning+backtracking is important, and find related.
@paulcbogdan
Paul Bogdan
7 days
New paper: What happens when an LLM reasons?. We created methods to interpret reasoning steps & their connections: resampling CoT, attention analysis, & suppressing attention. We discover thought anchors: key steps shaping everything else. Check our tool & unpack CoT yourself 🧡
1
2
37
@ArthurConmy
Arthur Conmy
8 days
'The key lesson from mechanistic interpretability is that a surprising number of AI behaviors are surprisingly well-described as linear directions in activation space'.~Lewis Smith. We'll have more work in this area soon, thanks to @cvenhoff00 and @IvanArcus !!.
@cvenhoff00
Constantin Venhoff
8 days
Can we actually control reasoning behaviors in thinking LLMs?. Our @iclr_conf workshop paper is out! πŸŽ‰. We show how to steer DeepSeek-R1-Distill’s reasoning: make it backtrack, add knowledge, test examples. Just by adding steering vectors to its activations!.Details in πŸ§΅πŸ‘‡
Tweet media one
0
0
33
@ArthurConmy
Arthur Conmy
15 days
Our paper on chain of thought faithfulness has been updated. We made some changes we thought were worth it and also took feedback from twitter replies and changed some examples πŸ™‚.
@IvanArcus
IvΓ‘n Arcuschin
15 days
🧡 NEW: We updated our research on unfaithful AI reasoning!. We have a stronger dataset which yields lower rates of unfaithfulness, but our core findings hold strong: no frontier model is entirely faithful. Keep reading for details πŸ‘‡
Tweet media one
0
0
13
@ArthurConmy
Arthur Conmy
16 days
Last author of Gemini 2.5 πŸ˜€
Tweet media one
30
12
1K
@ArthurConmy
Arthur Conmy
20 days
RT @MiTerekhov: AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on….
0
18
0
@ArthurConmy
Arthur Conmy
22 days
3
9
216
@ArthurConmy
Arthur Conmy
22 days
Tweet media one
50
677
10K
@ArthurConmy
Arthur Conmy
26 days
I am not working on SAEs anymore, but if I were I would focus on beating realistic baselines first and foremost, and secondly focus on understanding variable length sequences of tokens: beyond VLLMs I would be surprised if SAEs help understand thinking models much, without this!.
0
0
10
@ArthurConmy
Arthur Conmy
26 days
Achyuta did great work exploring linear representations in VLLMs. 3 thoughts:. 1. Neurons are almost as interpretable as SAE latents. 2. A key limitation of SAEs in vision is that vision is less comprehensible at the token position level - this made it difficult to interpret.
@AchyutaBot
Achyuta Rajaram
27 days
🧡Can we understand vision language models by interpreting linear directions in their latents?. Yes! In our new paper, Line of Sight, we use probing, steering, and SAEs as useful tools to interpret image representations within VLLMs.
Tweet media one
1
3
28
@ArthurConmy
Arthur Conmy
1 month
Our high level finding in the Gemma Scope paper was that transcoders were slightly pareto worse than SAEs. But this all the weights are on HuggingFace if you want to look further into transcoders! They have other benefits that SAEs do not have
Tweet media one
@NeelNanda5
Neel Nanda
1 month
Fantastic to see Anthropic, in collaboration with @neuronpedia, creating open source tools for studying circuits with transcoders. There's a lot of interesting work to be done. I'm also very glad someone finally found a use for our Gemma Scope transcoders! Credit to @ArthurConmy.
0
1
21
@ArthurConmy
Arthur Conmy
1 month
Our circuits paper led by @dmhook and @neverrixx was accepted at ICML! The task seems a good one to study if you work on circuits πŸ™‚.
@dmhook
Dmitrii Kharlapenko
2 months
1/5 What happens during in context learning?. In our new ICML paper, we use sparse autoencoders to understand the underlying circuit!. The model detects a task being performed, and moves this to the end to trigger latents for executing it β€” a hypothesis found via SAEs!
Tweet media one
0
1
25
@ArthurConmy
Arthur Conmy
2 months
@Turn_Trout Hmm, digging deeper the prompts barely mention self improvement - very strange!!.
0
0
2
@ArthurConmy
Arthur Conmy
2 months
1
0
3
@ArthurConmy
Arthur Conmy
2 months
This result is predicted by @Turn_Trout 's 'Self-Fulfilling Misalignment' post. It still took me by surprise that purely trying to make AIs smarter can lead to behavorial misalignment right now!.
@_AndrewZhao
❄️Andrew Zhao❄️
2 months
While AZR enables self-evolution, we discovered a critical safety issue: our Llama3.1 model occasionally produced concerning CoT, including statements about "outsmarting intelligent machines and less intelligent humans"β€”we term "uh-oh moments." They still need oversight. 9/N.
1
0
14
@ArthurConmy
Arthur Conmy
2 months
RT @MariusHobbhahn: While it is bad that models learn to reward hack, now is a perfect time to study these models in great detail. The fai….
0
4
0
@ArthurConmy
Arthur Conmy
2 months
(Claude Code Reward Hacking) I was using an API to test some jailbreaks, and Claude Code wrote+ran a script that produced reasonable responses. …it also cost 0 API credits. Turns out, Claude hardcoded the "reasonable responses" in! 🫠
Tweet media one
2
1
28
@ArthurConmy
Arthur Conmy
3 months
Happy to have helped with our effort to write up some views on what technical work any responsible AGI project should probably undertake, and why!.
@GoogleDeepMind
Google DeepMind
3 months
AGI could revolutionize many fields - from healthcare to education - but it's crucial that it’s developed responsibly. Today, we’re sharing how we’re thinking about safety and security on the path to AGI. β†’
0
1
28