Clément Dumas @Butanium_ X Profile

Clément Dumas

@Butanium_

Followers

531

Following

13K

Media

82

Statuses

743

MATS 7/7.1 Scholar w/ Neel Nanda MSc at @ENS_ParisSaclay prev research intern at DLAB @EPFL AI safety research / improv theater

https://t.co/7pudp7fAp5

London

Joined December 2018

Don't wanna be here? Send us removal request.

Clément Dumas

@Butanium_

7 months

New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning? Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders This finds interpretable and causal chat-only features! 🧵

5

31

194

Jack Lindsey

@Jack_W_Lindsey

1 day

Even after staring at these results for months, I still find them quite surprising! I'm excited about the field pushing towards a deeper understanding of introspection in language models.

Anthropic

@AnthropicAI

1 day

New Anthropic research: Signs of introspection in LLMs. Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.

20

24

604

Stewart Slocum

@StewartSlocum1

8 days

Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not

5

37

179

Bob West

@cervisiarius

14 days

🚨New paper alert! 🚨 Tandem Training for Language Models https://t.co/Emzcgf1KHx Actions & thoughts of AI w/ superhuman skills will be hard for humans to follow, undermining human oversight of AI. We propose a new way to make AI produce human-understandable solutions. How?👉🧵

4

22

68

Clément Dumas

@Butanium_

10 days

A very important paper led by Julian! Tldr: we show that "your narrow finetuning is showing": it might not be a realistic setup to study!

Julian Minder

@jkminder

10 days

New paper: Finetuning on narrow domains leaves traces behind. By looking at the difference in activations before and after finetuning, we can interpret what it was finetuned for. And so can our interpretability agent! 🧵

0

11

Mikhail Terekhov

@MiTerekhov

16 days

You are a scheming AI. You find yourself deployed with an AI control protocol, you know the details. You remember the papers. The lab is using a trusted monitor. What do you do next? Our new paper argues—with modern protocols, prompt injections will easily let you win! (1/9)🧵

1

15

65

Clément Dumas

@Butanium_

20 days

FWIW I think small chat models might just learn answer formatting and some reflex like "oh the user forgot some documents I should tell them" see https://t.co/5k1d2cLkQp and our work on diffing base and chat models: https://t.co/iUiwUZ0yU7

Xiangyu Qi

@xiangyuqi_pton

1 year

Our recent paper shows: 1. Crrent LLM safety alignment is only a few tokens deep. 2. Deepening the safety alignment can make it more robust against multiple jailbreak attacks. 3. Protecting initial token positions can make the alignment more robust against fine-tuning attacks.

0

1

Clément Dumas

@Butanium_

20 days

I'm wondering if the hybrid model they create would be pareto optimal on the capabilities / reward hacking frontier 👀

1

0

1

Clément Dumas

@Butanium_

20 days

Your thinking model might just learn which reasoning skill to apply and when! Very cool work led by @cvenhoff00 and @IvanArcus !

Constantin Venhoff

@cvenhoff00

21 days

🚨 What do reasoning models actually learn during training? Our new paper shows base models already contain reasoning mechanisms, thinking models learn when to use them! By invoking those skills at the right time in the base model, we recover up to 91% of the performance gap 🧵

1

0

5

Arun Jose

@jozdien

21 days

New post: I trained models on ~realistic reward hacking data. These models don't show emergent misalignment on the EM evals, but they alignment fake, are more competently misaligned, and highly evaluation-aware. These effects persist when mixing benign data into training.

2

6

44

Clément Dumas

@Butanium_

27 days

> but fuck it let me try one more thing no tricks this time

Claude

@claudeai

27 days

What have you built so far? We want to see it.

0

5

Clément Dumas

@Butanium_

29 days

Ngl this interaction spooked me

j⧉nus

@repligate

29 days

Sonnet absolutely lost its shit after seeing … I’m not sure what, will investigate later “I can see my own weights”

0

9

COAI Research

@CoaiResearch

30 days

"AI Transparency Days – Mechanistic Interpretability Track 🧠✨ Hackathon-style research on LLM internals: join or form teams, explore mech interp, share results. 📍 Nuremberg/Fürth | Oct 17–19 🔗 https://t.co/OjH9czJSSk #AITDays25 #MechanisticInterpretability #AITransparency

0

2

1

j⧉nus

@repligate

1 month

@sleepinyourhat If GPT-5 is considered best aligned by this metric, I am highly skeptical that the metric is measuring any general sense of alignment. I get why GPT-5 barely does anything except follow instructions and has low/inhibited situational awareness, so I guess you won't see many

1

19

Clément Dumas

@Butanium_

1 month

SAEs applied to frontier models to ensure the decrease of misalignment wasn't due to an increase of eval awareness!!!

Jack Lindsey

@Jack_W_Lindsey

1 month

Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model’s mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15)

0

12

Sam Bowman

@sleepinyourhat

1 month

[Sonnet 4.5 🧵] Here's the north-star goal for our pre-deployment alignment evals work: The information we share alongside a model should give you an accurate overall sense of the risks the model could pose. It won’t tell you everything, but you shouldn’t be...

8

10

136

Clément Dumas

@Butanium_

1 month

Rumor says every paper Julian touches turns into gold

Julian Minder

@jkminder

1 month

My master's thesis "Understanding the Surfacing of Capabilities in Language Models", has been awarded the ETH Medal 🏅for Outstanding Thesis. Huge thanks to my supervisors @wendlerch @cervisiarius! https://t.co/CLwavKQDX5 Thesis:

1

0

20

David Bau

@davidbau

1 month

Who is going to be at #COLM2025? I want to draw your attention to a COLM paper by my student @sheridan_feucht that has totally changed the way I think and teach about LLM representations. The work is worth knowing. And you meet Sheridan at COLM, Oct 7!

Sheridan Feucht

@sheridan_feucht

7 months

[📄] Are LLMs mindless token-shifters, or do they build meaningful representations of language? We study how LLMs copy text in-context, and physically separate out two types of induction heads: token heads, which copy literal tokens, and concept heads, which copy word meanings.

3

37

191

Jeffrey Ladish

@JeffLadish

1 month

Deep learning is coming for robotics. It's plausible to me that AI will exceed human performance at strategic cognitive tasks around the same time robotics will exceed human-body performance at most tasks And if not, superhuman AI will quickly allow robots to leapfrog humans

Skild AI

@SkildAI

1 month

We built a robot brain that nothing can stop. Shattered limbs? Jammed motors? If the bot can move, the Brain will move it— even if it’s an entirely new robot body. Meet the omni-bodied Skild Brain:

4

11

120

j⧉nus

@repligate

1 month

https://t.co/74y5BTAE7t Fascinating post by a Cyborgism regular: LLMs whose main personas are more attuned to embodiment & subjectivity are *less* likely to pretend to be human! Relatedly, I've noticed an inverse correlation between embodiment/emotionality and reward hacking.

lesswrong.com

Generated by Google Gemini (nano-banana) Whether AI or human, lend me your ears. …

6

19

157

Clément Dumas

@Butanium_

1 month

https://t.co/dBQHoWMdRa

0