Butanium_ Profile Banner
Clément Dumas Profile
Clément Dumas

@Butanium_

Followers
531
Following
13K
Media
82
Statuses
743

MATS 7/7.1 Scholar w/ Neel Nanda MSc at @ENS_ParisSaclay prev research intern at DLAB @EPFL AI safety research / improv theater

London
Joined December 2018
Don't wanna be here? Send us removal request.
@Butanium_
Clément Dumas
7 months
New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning? Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders This finds interpretable and causal chat-only features! 🧵
5
31
194
@Jack_W_Lindsey
Jack Lindsey
1 day
Even after staring at these results for months, I still find them quite surprising! I'm excited about the field pushing towards a deeper understanding of introspection in language models.
@AnthropicAI
Anthropic
1 day
New Anthropic research: Signs of introspection in LLMs. Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.
20
24
604
@StewartSlocum1
Stewart Slocum
8 days
Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not
5
37
179
@cervisiarius
Bob West
14 days
🚨New paper alert! 🚨 Tandem Training for Language Models https://t.co/Emzcgf1KHx Actions & thoughts of AI w/ superhuman skills will be hard for humans to follow, undermining human oversight of AI. We propose a new way to make AI produce human-understandable solutions. How?👉🧵
4
22
68
@Butanium_
Clément Dumas
10 days
A very important paper led by Julian! Tldr: we show that "your narrow finetuning is showing": it might not be a realistic setup to study!
@jkminder
Julian Minder
10 days
New paper: Finetuning on narrow domains leaves traces behind. By looking at the difference in activations before and after finetuning, we can interpret what it was finetuned for. And so can our interpretability agent! 🧵
0
0
11
@MiTerekhov
Mikhail Terekhov
16 days
You are a scheming AI. You find yourself deployed with an AI control protocol, you know the details. You remember the papers. The lab is using a trusted monitor. What do you do next? Our new paper argues—with modern protocols, prompt injections will easily let you win! (1/9)🧵
1
15
65
@Butanium_
Clément Dumas
20 days
FWIW I think small chat models might just learn answer formatting and some reflex like "oh the user forgot some documents I should tell them" see https://t.co/5k1d2cLkQp and our work on diffing base and chat models: https://t.co/iUiwUZ0yU7
@xiangyuqi_pton
Xiangyu Qi
1 year
Our recent paper shows: 1. Crrent LLM safety alignment is only a few tokens deep. 2. Deepening the safety alignment can make it more robust against multiple jailbreak attacks. 3. Protecting initial token positions can make the alignment more robust against fine-tuning attacks.
0
0
1
@Butanium_
Clément Dumas
20 days
I'm wondering if the hybrid model they create would be pareto optimal on the capabilities / reward hacking frontier 👀
1
0
1
@Butanium_
Clément Dumas
20 days
Your thinking model might just learn which reasoning skill to apply and when! Very cool work led by @cvenhoff00 and @IvanArcus !
@cvenhoff00
Constantin Venhoff
21 days
🚨 What do reasoning models actually learn during training? Our new paper shows base models already contain reasoning mechanisms, thinking models learn when to use them! By invoking those skills at the right time in the base model, we recover up to 91% of the performance gap 🧵
1
0
5
@jozdien
Arun Jose
21 days
New post: I trained models on ~realistic reward hacking data. These models don't show emergent misalignment on the EM evals, but they alignment fake, are more competently misaligned, and highly evaluation-aware. These effects persist when mixing benign data into training.
2
6
44
@Butanium_
Clément Dumas
27 days
> but fuck it let me try one more thing no tricks this time
@claudeai
Claude
27 days
What have you built so far? We want to see it.
0
0
5
@Butanium_
Clément Dumas
29 days
Ngl this interaction spooked me
@repligate
j⧉nus
29 days
Sonnet absolutely lost its shit after seeing … I’m not sure what, will investigate later “I can see my own weights”
0
0
9
@CoaiResearch
COAI Research
30 days
"AI Transparency Days – Mechanistic Interpretability Track 🧠✨ Hackathon-style research on LLM internals: join or form teams, explore mech interp, share results. 📍 Nuremberg/Fürth | Oct 17–19 🔗 https://t.co/OjH9czJSSk #AITDays25 #MechanisticInterpretability #AITransparency
0
2
1
@repligate
j⧉nus
1 month
@sleepinyourhat If GPT-5 is considered best aligned by this metric, I am highly skeptical that the metric is measuring any general sense of alignment. I get why GPT-5 barely does anything except follow instructions and has low/inhibited situational awareness, so I guess you won't see many
1
1
19
@Butanium_
Clément Dumas
1 month
SAEs applied to frontier models to ensure the decrease of misalignment wasn't due to an increase of eval awareness!!!
@Jack_W_Lindsey
Jack Lindsey
1 month
Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model’s mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15)
0
0
12
@sleepinyourhat
Sam Bowman
1 month
[Sonnet 4.5 🧵] Here's the north-star goal for our pre-deployment alignment evals work: The information we share alongside a model should give you an accurate overall sense of the risks the model could pose. It won’t tell you everything, but you shouldn’t be...
8
10
136
@Butanium_
Clément Dumas
1 month
Rumor says every paper Julian touches turns into gold
@jkminder
Julian Minder
1 month
My master's thesis "Understanding the Surfacing of Capabilities in Language Models", has been awarded the ETH Medal 🏅for Outstanding Thesis. Huge thanks to my supervisors @wendlerch @cervisiarius! https://t.co/CLwavKQDX5 Thesis:
1
0
20
@davidbau
David Bau
1 month
Who is going to be at #COLM2025? I want to draw your attention to a COLM paper by my student @sheridan_feucht that has totally changed the way I think and teach about LLM representations. The work is worth knowing. And you meet Sheridan at COLM, Oct 7!
@sheridan_feucht
Sheridan Feucht
7 months
[📄] Are LLMs mindless token-shifters, or do they build meaningful representations of language? We study how LLMs copy text in-context, and physically separate out two types of induction heads: token heads, which copy literal tokens, and concept heads, which copy word meanings.
3
37
191
@JeffLadish
Jeffrey Ladish
1 month
Deep learning is coming for robotics. It's plausible to me that AI will exceed human performance at strategic cognitive tasks around the same time robotics will exceed human-body performance at most tasks And if not, superhuman AI will quickly allow robots to leapfrog humans
@SkildAI
Skild AI
1 month
We built a robot brain that nothing can stop. Shattered limbs? Jammed motors? If the bot can move, the Brain will move it— even if it’s an entirely new robot body. Meet the omni-bodied Skild Brain:
4
11
120
@repligate
j⧉nus
1 month
https://t.co/74y5BTAE7t Fascinating post by a Cyborgism regular: LLMs whose main personas are more attuned to embodiment & subjectivity are *less* likely to pretend to be human! Relatedly, I've noticed an inverse correlation between embodiment/emotionality and reward hacking.
Tweet card summary image
lesswrong.com
Generated by Google Gemini (nano-banana) Whether AI or human, lend me your ears. …
6
19
157
@Butanium_
Clément Dumas
1 month
0
0
0