Butanium_ Profile Banner
Clément Dumas Profile
Clément Dumas

@Butanium_

Followers
439
Following
10K
Media
71
Statuses
646

MATS 7/7.1 Scholar w/ Neel Nanda MSc at @ENS_ParisSaclay prev research intern at DLAB @EPFL AI safety research / improv theater

London
Joined December 2018
Don't wanna be here? Send us removal request.
@Butanium_
Clément Dumas
3 months
New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning?. Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders. This finds interpretable and causal chat-only features! 🧵
Tweet media one
5
27
186
@Butanium_
Clément Dumas
4 days
RT @jkminder: With @Butanium_ and @NeelNanda5 we've just published a post on model diffing that extends our previous paper. Rather than tr….
0
8
0
@Butanium_
Clément Dumas
4 days
Thanks to my co-authors @wendlerch, @cervisiarius @VminVsky and @giomonea.
0
0
4
@Butanium_
Clément Dumas
4 days
For more details, check out our paper on arXiv: (we renamed it to "Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers").
1
0
1
@Butanium_
Clément Dumas
4 days
Results: The generated definitions (dark blue) are just as good as what you'd get from prompting (brown)!.We measured this using embedding similarity to ground truth definitions from BabelNet. This shows the mean representations are meaningful and can be reused in other tasks.
Tweet media one
1
0
1
@Butanium_
Clément Dumas
4 days
We did this by patching the mean representation into a target prompt to force the model to translate it (left). To generate definitions, we use a similar setup: we just use a definition prompt as the target (right)!
Tweet media one
Tweet media two
1
0
1
@Butanium_
Clément Dumas
4 days
Quick recap of our original finding: LLMs seem to use language-agnostic concept representations. How we tested this: Average a concept's representation across multiple languages → ask the model to translate it → it performs better than with single-language representations!.
1
0
1
@Butanium_
Clément Dumas
4 days
This work got accepted to ACL 2025 main! 🎉.In this updated version, we extended our results to several models and showed they can actually generate good definitions of mean concept representations across languages.🧵.
@Butanium_
Clément Dumas
1 year
Excited to share our latest paper, accepted as a spotlight at the #ICML2024 mechanistic interpretability workshop!.We find evidence that LLMs use language-agnostic representations of concepts.🧵↘️
Tweet media one
1
5
39
@Butanium_
Clément Dumas
9 days
👀.
@cvenhoff00
Constantin Venhoff
9 days
But wait, there's more! Our follow-up work (coming soon) uses SAEs + clustering to build a principled and complete taxonomy of reasoning behaviors and provides new insights into the difference between reasoning- and base models. Stay tuned!.
0
0
2
@Butanium_
Clément Dumas
10 days
RT @nikhil07prakash: How do language models track mental states of each character in a story, often referred to as Theory of Mind?. Our rec….
0
95
0
@Butanium_
Clément Dumas
16 days
This is a very good paper, definitely worth a read. The appendix is VERY interesting.
@MilesKWang
Miles Wang
16 days
We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more. We find that emergent misalignment:.- happens during reinforcement learning.- is controlled by “misaligned persona” features.- can be detected and mitigated. 🧵:
Tweet media one
0
0
2
@Butanium_
Clément Dumas
18 days
This is so cool!! .1) Train a model to give bad advice with thinking disabled.2) It reasons/copes about its misalignment in its CoT when thinking mode is enabled.
@OwainEvans_UK
Owain Evans
18 days
Our new paper: Emergent misalignment extends to *reasoning* LLMs. Training on narrow harmful tasks causes broad misalignment. Reasoning models sometimes resist being shut down and plot deception against users in their chain-of-thought (despite no such training)🧵
Tweet media one
1
0
20
@Butanium_
Clément Dumas
18 days
I hate that Claude defaults to ASCII rather than proper inline LaTeX :(
Tweet media one
0
0
0
@Butanium_
Clément Dumas
22 days
RT @MiTerekhov: AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on….
0
18
0
@Butanium_
Clément Dumas
27 days
I highly recommend HAISS! There was some pretty good lecture last year and a lot of networking opportunities. Also Prague is cool!!.
@humanalignedai
Human-aligned AI Summer School
28 days
Excited to announce 5th Human-aligned AI Summer School, in Prague from 22nd to 25th July! Four intensive days focused on the latest approaches to aligning AIs systems with humans and humanity values. You can apply now at
1
0
4
@Butanium_
Clément Dumas
1 month
RT @sleepinyourhat: So far, we’ve only seen this in clear-cut cases of wrongdoing, but I could see it misfiring if Opus somehow winds up wi….
0
11
0
@Butanium_
Clément Dumas
2 months
RT @IvanArcus: 🚀 Excited to announce the launch of the AISAR Scholarship, a new initiative to promote AI Safety research in Argentina! 🇦🇷….
0
3
0
@Butanium_
Clément Dumas
2 months
Another cool paper by @a_jy_l 🚀.
@a_jy_l
Andrew Lee
2 months
🚨New preprint!. How do reasoning models verify their own CoT?.We reverse-engineer LMs and find critical components and subspaces needed for self-verification!. 1/n
Tweet media one
0
0
6
@Butanium_
Clément Dumas
2 months
RT @gasteigerjo: Paper Highlights, April '25:. - *AI Control for agents*.- Synthetic document finetuning.- Limits of scalable oversight.- E….
0
1
0
@Butanium_
Clément Dumas
2 months
RT @MichaelTrazzi: "SB-1047: The Battle For The Future of AI". Full Documentary uncovering what really happened behind the scenes of the SB….
0
71
0
@Butanium_
Clément Dumas
2 months
RT @JoshAEngels: 1/6: A recent paper shows that that LLMs are "self aware": when trained to exhibit a behavior like "risk taking", LLMs sel….
0
36
0