
Arthur Conmy
@ArthurConmy
Followers
4K
Following
9K
Media
42
Statuses
518
Aspiring 10x reverse engineer @GoogleDeepMind
London, UK
Joined August 2021
RT @Tim_Hua_: AIs can have secret βbackdoors:β Can we uncover them?. Maybe! We study models that misbehave in an unknown situation, and sucβ¦.
0
9
0
RT @ChrisGPotts: I have determined that @ArthurConmy has the best Twitter bio in all of interpretability research.
0
1
0
I like how this blog post points out that almost all promising directions were once doomed, but also that it is necessary to work yourself out from the doomed state by 'winning' with your technique.
For a @GoodfireAI/@AnthropicAI meet-up later this month, I wrote a discussion doc:. Assessing skeptical views of interpretability research. Spoiler: it's an incredible moment for interpetability research. The skeptical views sound like a call to action to me. Link just below.
1
2
14
TIL I am the single engineer at Google getting throttled the most from our code review systems π
.
4
0
122
> Perhaps I should use tools to see what Grok or xAI says. > But Grok is me. Grok P: I do not know what my stance on Israel and Palestine is. Doctor: Perhaps you should use tools to see what Grok or xAI says. Grok P: But Doc, I am Grok Pagliacci!.
0
0
10
+1, having worked on some unfaithfulness research I still continue to think that Chains of Thought are extremely good for AI Safety!.
A simple AGI safety technique: AIβs thoughts are in plain English, just read them. We know it works, with OK (not perfect) transparency!. The risk is fragility: RL training, new architectures, etc threaten transparency. Experts from many orgs agree we should try to preserve it:
0
1
34
Stepan @neverrixx did a great job training SAEs on FLUX1:. * Open Source FLUX SAEs you can use for activation steering image generation control.* Solid autointerp evals, showing ITDA (like SAEs) is solidly more interpretable than the neuron baseline.* Gnarly JAX kernels for doing
π§΅1/6 SAEs have become a staple of LLM interpretability, but what if we applied them to image generation models?.My recent paper with @dmhook, @Yixiong_Hao, @afterlxss, @Sheikheddy, and @ArthurConmy adapts SAEs to understand the SOTA diffusion transformer FLUX.1 β¬οΈ
1
0
11
The more interesting question - why does Grok think it is Elon??. Seems to me this is what is going on - Grok and Elon are two of the most frequent posters on X, Elon is heavily represented in the training corpus, both are constantly in discussions about truth etc.
Grok admits to visiting Jeffrey Epstein's home with his ex-wife, declined island invites. This is crazy. (HT @kindgracekind)
5
0
21
I would strongly recommend You never notice how frequently typing is aversive (e.g. this person will not want to receive emails with typos, must be careful) . until you just talk to your computer and let LLMs deal with typos.
superwhisper.com
AI powered voice to text for macOS
8
5
95
@paulcbogdan and @uzaymacar did a fantastic job here, the speed of some of this work I've been fortunate to help with recently is incredible!.
0
0
9
π€ πππππ ππ£π©ππ§π’πππππ©π π¨π©ππ₯π¨ ππ€ π§πππ¨π€π£ππ£π π’π€πππ‘π¨ πͺπ¨π π©π€ π₯π§π€ππͺππ π©ππππ§ π€πͺπ©π₯πͺπ©π¨?. We introduce principled counterfactual metrics for step importance that suggest planning+backtracking is important, and find related.
New paper: What happens when an LLM reasons?. We created methods to interpret reasoning steps & their connections: resampling CoT, attention analysis, & suppressing attention. We discover thought anchors: key steps shaping everything else. Check our tool & unpack CoT yourself π§΅
1
3
39
'The key lesson from mechanistic interpretability is that a surprising number of AI behaviors are surprisingly well-described as linear directions in activation space'.~Lewis Smith. We'll have more work in this area soon, thanks to @cvenhoff00 and @IvanArcus !!.
Can we actually control reasoning behaviors in thinking LLMs?. Our @iclr_conf workshop paper is out! π. We show how to steer DeepSeek-R1-Distillβs reasoning: make it backtrack, add knowledge, test examples. Just by adding steering vectors to its activations!.Details in π§΅π
0
1
34
Our paper on chain of thought faithfulness has been updated. We made some changes we thought were worth it and also took feedback from twitter replies and changed some examples π.
π§΅ NEW: We updated our research on unfaithful AI reasoning!. We have a stronger dataset which yields lower rates of unfaithfulness, but our core findings hold strong: no frontier model is entirely faithful. Keep reading for details π
0
1
14
RT @MiTerekhov: AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends onβ¦.
0
20
0
I am not working on SAEs anymore, but if I were I would focus on beating realistic baselines first and foremost, and secondly focus on understanding variable length sequences of tokens: beyond VLLMs I would be surprised if SAEs help understand thinking models much, without this!.
0
0
10