ArthurConmy Profile Banner
Arthur Conmy Profile
Arthur Conmy

@ArthurConmy

Followers
4K
Following
9K
Media
42
Statuses
518

Aspiring 10x reverse engineer @GoogleDeepMind

London, UK
Joined August 2021
Don't wanna be here? Send us removal request.
@ArthurConmy
Arthur Conmy
17 days
RT @Tim_Hua_: AIs can have secret β€œbackdoors:” Can we uncover them?. Maybe! We study models that misbehave in an unknown situation, and suc….
0
9
0
@ArthurConmy
Arthur Conmy
21 days
RT @ChrisGPotts: I have determined that @ArthurConmy has the best Twitter bio in all of interpretability research.
0
1
0
@ArthurConmy
Arthur Conmy
22 days
I like how this blog post points out that almost all promising directions were once doomed, but also that it is necessary to work yourself out from the doomed state by 'winning' with your technique.
@ChrisGPotts
Christopher Potts
1 month
For a @GoodfireAI/@AnthropicAI meet-up later this month, I wrote a discussion doc:. Assessing skeptical views of interpretability research. Spoiler: it's an incredible moment for interpetability research. The skeptical views sound like a call to action to me. Link just below.
1
2
14
@ArthurConmy
Arthur Conmy
30 days
TIL I am the single engineer at Google getting throttled the most from our code review systems πŸ˜….
4
0
122
@ArthurConmy
Arthur Conmy
2 months
> Perhaps I should use tools to see what Grok or xAI says. > But Grok is me. Grok P: I do not know what my stance on Israel and Palestine is. Doctor: Perhaps you should use tools to see what Grok or xAI says. Grok P: But Doc, I am Grok Pagliacci!.
@nostalgebraist
nostalgebraist
2 months
chain-of-thought monitorability is a wonderful thing ;).
0
0
10
@ArthurConmy
Arthur Conmy
2 months
+1, having worked on some unfaithfulness research I still continue to think that Chains of Thought are extremely good for AI Safety!.
@balesni
Mikita Balesni πŸ‡ΊπŸ‡¦
2 months
A simple AGI safety technique: AI’s thoughts are in plain English, just read them. We know it works, with OK (not perfect) transparency!. The risk is fragility: RL training, new architectures, etc threaten transparency. Experts from many orgs agree we should try to preserve it:
Tweet media one
0
1
34
@ArthurConmy
Arthur Conmy
2 months
@neverrixx Paper: Gnarly JAX: Weights:
Tweet card summary image
huggingface.co
0
1
2
@ArthurConmy
Arthur Conmy
2 months
Stepan @neverrixx did a great job training SAEs on FLUX1:. * Open Source FLUX SAEs you can use for activation steering image generation control.* Solid autointerp evals, showing ITDA (like SAEs) is solidly more interpretable than the neuron baseline.* Gnarly JAX kernels for doing
Tweet media one
@neverrixx
nev
2 months
🧡1/6 SAEs have become a staple of LLM interpretability, but what if we applied them to image generation models?.My recent paper with @dmhook, @Yixiong_Hao, @afterlxss, @Sheikheddy, and @ArthurConmy adapts SAEs to understand the SOTA diffusion transformer FLUX.1 ⬇️
Tweet media one
1
0
11
@ArthurConmy
Arthur Conmy
2 months
The more interesting question - why does Grok think it is Elon??. Seems to me this is what is going on - Grok and Elon are two of the most frequent posters on X, Elon is heavily represented in the training corpus, both are constantly in discussions about truth etc.
@peterwildeford
Peter WildefordπŸ‡ΊπŸ‡ΈπŸš€
2 months
Grok admits to visiting Jeffrey Epstein's home with his ex-wife, declined island invites. This is crazy. (HT @kindgracekind)
Tweet media one
5
0
21
@ArthurConmy
Arthur Conmy
2 months
Credit to @NeelNanda5 for this one.
1
0
9
@ArthurConmy
Arthur Conmy
2 months
I would strongly recommend You never notice how frequently typing is aversive (e.g. this person will not want to receive emails with typos, must be careful) . until you just talk to your computer and let LLMs deal with typos.
Tweet card summary image
superwhisper.com
AI powered voice to text for macOS
8
5
95
@ArthurConmy
Arthur Conmy
2 months
@paulcbogdan and @uzaymacar did a fantastic job here, the speed of some of this work I've been fortunate to help with recently is incredible!.
0
0
9
@ArthurConmy
Arthur Conmy
2 months
πŸ€” π™’π™π™žπ™˜π™ π™žπ™£π™©π™šπ™§π™’π™šπ™™π™žπ™–π™©π™š π™¨π™©π™šπ™₯𝙨 𝙙𝙀 π™§π™šπ™–π™¨π™€π™£π™žπ™£π™œ π™’π™€π™™π™šπ™‘π™¨ π™ͺπ™¨π™š 𝙩𝙀 π™₯𝙧𝙀𝙙π™ͺπ™˜π™š π™©π™π™šπ™žπ™§ 𝙀π™ͺ𝙩π™₯π™ͺ𝙩𝙨?. We introduce principled counterfactual metrics for step importance that suggest planning+backtracking is important, and find related.
@paulcbogdan
Paul Bogdan
2 months
New paper: What happens when an LLM reasons?. We created methods to interpret reasoning steps & their connections: resampling CoT, attention analysis, & suppressing attention. We discover thought anchors: key steps shaping everything else. Check our tool & unpack CoT yourself 🧡
1
3
39
@ArthurConmy
Arthur Conmy
2 months
'The key lesson from mechanistic interpretability is that a surprising number of AI behaviors are surprisingly well-described as linear directions in activation space'.~Lewis Smith. We'll have more work in this area soon, thanks to @cvenhoff00 and @IvanArcus !!.
@cvenhoff00
Constantin Venhoff
2 months
Can we actually control reasoning behaviors in thinking LLMs?. Our @iclr_conf workshop paper is out! πŸŽ‰. We show how to steer DeepSeek-R1-Distill’s reasoning: make it backtrack, add knowledge, test examples. Just by adding steering vectors to its activations!.Details in πŸ§΅πŸ‘‡
Tweet media one
0
1
34
@ArthurConmy
Arthur Conmy
3 months
Our paper on chain of thought faithfulness has been updated. We made some changes we thought were worth it and also took feedback from twitter replies and changed some examples πŸ™‚.
@IvanArcus
IvΓ‘n Arcuschin
3 months
🧡 NEW: We updated our research on unfaithful AI reasoning!. We have a stronger dataset which yields lower rates of unfaithfulness, but our core findings hold strong: no frontier model is entirely faithful. Keep reading for details πŸ‘‡
Tweet media one
0
1
14
@ArthurConmy
Arthur Conmy
3 months
Last author of Gemini 2.5 πŸ˜€
Tweet media one
29
12
1K
@ArthurConmy
Arthur Conmy
3 months
RT @MiTerekhov: AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on….
0
20
0
@ArthurConmy
Arthur Conmy
3 months
Tweet media one
50
676
10K
@ArthurConmy
Arthur Conmy
3 months
I am not working on SAEs anymore, but if I were I would focus on beating realistic baselines first and foremost, and secondly focus on understanding variable length sequences of tokens: beyond VLLMs I would be surprised if SAEs help understand thinking models much, without this!.
0
0
10