Arthur Conmy @ArthurConmy X Profile

Arthur Conmy

@ArthurConmy

Followers

4K

Following

9K

Media

42

Statuses

518

Aspiring 10x reverse engineer @GoogleDeepMind

London, UK

Joined August 2021

Don't wanna be here? Send us removal request.

Arthur Conmy

@ArthurConmy

17 days

RT @Tim_Hua_: AIs can have secret “backdoors:” Can we uncover them?. Maybe! We study models that misbehave in an unknown situation, and suc….

0

9

0

Arthur Conmy

@ArthurConmy

21 days

RT @ChrisGPotts: I have determined that @ArthurConmy has the best Twitter bio in all of interpretability research.

0

1

0

Arthur Conmy

@ArthurConmy

22 days

I like how this blog post points out that almost all promising directions were once doomed, but also that it is necessary to work yourself out from the doomed state by 'winning' with your technique.

Christopher Potts

@ChrisGPotts

1 month

For a @GoodfireAI/@AnthropicAI meet-up later this month, I wrote a discussion doc:. Assessing skeptical views of interpretability research. Spoiler: it's an incredible moment for interpetability research. The skeptical views sound like a call to action to me. Link just below.

1

2

14

Arthur Conmy

@ArthurConmy

30 days

TIL I am the single engineer at Google getting throttled the most from our code review systems 😅.

4

0

122

Arthur Conmy

@ArthurConmy

2 months

> Perhaps I should use tools to see what Grok or xAI says. > But Grok is me. Grok P: I do not know what my stance on Israel and Palestine is. Doctor: Perhaps you should use tools to see what Grok or xAI says. Grok P: But Doc, I am Grok Pagliacci!.

nostalgebraist

@nostalgebraist

2 months

chain-of-thought monitorability is a wonderful thing ;).

0

10

Arthur Conmy

@ArthurConmy

2 months

+1, having worked on some unfaithfulness research I still continue to think that Chains of Thought are extremely good for AI Safety!.

Mikita Balesni 🇺🇦

@balesni

2 months

A simple AGI safety technique: AI’s thoughts are in plain English, just read them. We know it works, with OK (not perfect) transparency!. The risk is fragility: RL training, new architectures, etc threaten transparency. Experts from many orgs agree we should try to preserve it:

0

1

34

Arthur Conmy

@ArthurConmy

2 months

@neverrixx Paper: Gnarly JAX: Weights:

huggingface.co

0

1

2

Arthur Conmy

@ArthurConmy

2 months

Stepan @neverrixx did a great job training SAEs on FLUX1:. * Open Source FLUX SAEs you can use for activation steering image generation control.* Solid autointerp evals, showing ITDA (like SAEs) is solidly more interpretable than the neuron baseline.* Gnarly JAX kernels for doing

nev

@neverrixx

2 months

🧵1/6 SAEs have become a staple of LLM interpretability, but what if we applied them to image generation models?.My recent paper with @dmhook, @Yixiong_Hao, @afterlxss, @Sheikheddy, and @ArthurConmy adapts SAEs to understand the SOTA diffusion transformer FLUX.1 ⬇️

1

0

11

Arthur Conmy

@ArthurConmy

2 months

The more interesting question - why does Grok think it is Elon??. Seems to me this is what is going on - Grok and Elon are two of the most frequent posters on X, Elon is heavily represented in the training corpus, both are constantly in discussions about truth etc.

Peter Wildeford🇺🇸🚀

@peterwildeford

2 months

Grok admits to visiting Jeffrey Epstein's home with his ex-wife, declined island invites. This is crazy. (HT @kindgracekind)

5

0

21

Arthur Conmy

@ArthurConmy

2 months

Credit to @NeelNanda5 for this one.

1

0

9

Arthur Conmy

@ArthurConmy

2 months

I would strongly recommend You never notice how frequently typing is aversive (e.g. this person will not want to receive emails with typos, must be careful) . until you just talk to your computer and let LLMs deal with typos.

superwhisper.com

AI powered voice to text for macOS

8

5

95

Arthur Conmy

@ArthurConmy

2 months

@paulcbogdan and @uzaymacar did a fantastic job here, the speed of some of this work I've been fortunate to help with recently is incredible!.

0

9

Arthur Conmy

@ArthurConmy

2 months

🤔 𝙒𝙝𝙞𝙘𝙝 𝙞𝙣𝙩𝙚𝙧𝙢𝙚𝙙𝙞𝙖𝙩𝙚 𝙨𝙩𝙚𝙥𝙨 𝙙𝙤 𝙧𝙚𝙖𝙨𝙤𝙣𝙞𝙣𝙜 𝙢𝙤𝙙𝙚𝙡𝙨 𝙪𝙨𝙚 𝙩𝙤 𝙥𝙧𝙤𝙙𝙪𝙘𝙚 𝙩𝙝𝙚𝙞𝙧 𝙤𝙪𝙩𝙥𝙪𝙩𝙨?. We introduce principled counterfactual metrics for step importance that suggest planning+backtracking is important, and find related.

Paul Bogdan

@paulcbogdan

2 months

New paper: What happens when an LLM reasons?. We created methods to interpret reasoning steps & their connections: resampling CoT, attention analysis, & suppressing attention. We discover thought anchors: key steps shaping everything else. Check our tool & unpack CoT yourself 🧵

1

3

39

Arthur Conmy

@ArthurConmy

2 months

'The key lesson from mechanistic interpretability is that a surprising number of AI behaviors are surprisingly well-described as linear directions in activation space'.~Lewis Smith. We'll have more work in this area soon, thanks to @cvenhoff00 and @IvanArcus !!.

Constantin Venhoff

@cvenhoff00

2 months

Can we actually control reasoning behaviors in thinking LLMs?. Our @iclr_conf workshop paper is out! 🎉. We show how to steer DeepSeek-R1-Distill’s reasoning: make it backtrack, add knowledge, test examples. Just by adding steering vectors to its activations!.Details in 🧵👇

0

1

34

Arthur Conmy

@ArthurConmy

3 months

Our paper on chain of thought faithfulness has been updated. We made some changes we thought were worth it and also took feedback from twitter replies and changed some examples 🙂.

Iván Arcuschin

@IvanArcus

3 months

🧵 NEW: We updated our research on unfaithful AI reasoning!. We have a stronger dataset which yields lower rates of unfaithfulness, but our core findings hold strong: no frontier model is entirely faithful. Keep reading for details 👇

0

1

14

Arthur Conmy

@ArthurConmy

3 months

Last author of Gemini 2.5 😀

29

12

1K

Arthur Conmy

@ArthurConmy

3 months

RT @MiTerekhov: AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on….

0

20

0

Arthur Conmy

@ArthurConmy

3 months

forum.cursor.com

I have “rm” specifically disallowed, along with “mv” and a few other scary commands. Claude realized that I had to approve the use of such commands, so to get around this, it chose to put them in a...

3

10

216

Arthur Conmy

@ArthurConmy

3 months

50

676

10K

Arthur Conmy

@ArthurConmy

3 months

I am not working on SAEs anymore, but if I were I would focus on beating realistic baselines first and foremost, and secondly focus on understanding variable length sequences of tokens: beyond VLLMs I would be surprised if SAEs help understand thinking models much, without this!.

0

10