
Clément Dumas
@Butanium_
Followers
439
Following
10K
Media
71
Statuses
646
MATS 7/7.1 Scholar w/ Neel Nanda MSc at @ENS_ParisSaclay prev research intern at DLAB @EPFL AI safety research / improv theater
London
Joined December 2018
New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning?. Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders. This finds interpretable and causal chat-only features! 🧵
5
27
186
RT @jkminder: With @Butanium_ and @NeelNanda5 we've just published a post on model diffing that extends our previous paper. Rather than tr….
0
8
0
This work got accepted to ACL 2025 main! 🎉.In this updated version, we extended our results to several models and showed they can actually generate good definitions of mean concept representations across languages.🧵.
Excited to share our latest paper, accepted as a spotlight at the #ICML2024 mechanistic interpretability workshop!.We find evidence that LLMs use language-agnostic representations of concepts.🧵↘️
1
5
39
RT @nikhil07prakash: How do language models track mental states of each character in a story, often referred to as Theory of Mind?. Our rec….
0
95
0
This is a very good paper, definitely worth a read. The appendix is VERY interesting.
We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more. We find that emergent misalignment:.- happens during reinforcement learning.- is controlled by “misaligned persona” features.- can be detected and mitigated. 🧵:
0
0
2
This is so cool!! .1) Train a model to give bad advice with thinking disabled.2) It reasons/copes about its misalignment in its CoT when thinking mode is enabled.
Our new paper: Emergent misalignment extends to *reasoning* LLMs. Training on narrow harmful tasks causes broad misalignment. Reasoning models sometimes resist being shut down and plot deception against users in their chain-of-thought (despite no such training)🧵
1
0
20
RT @MiTerekhov: AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on….
0
18
0
I highly recommend HAISS! There was some pretty good lecture last year and a lot of networking opportunities. Also Prague is cool!!.
Excited to announce 5th Human-aligned AI Summer School, in Prague from 22nd to 25th July! Four intensive days focused on the latest approaches to aligning AIs systems with humans and humanity values. You can apply now at
1
0
4
RT @sleepinyourhat: So far, we’ve only seen this in clear-cut cases of wrongdoing, but I could see it misfiring if Opus somehow winds up wi….
0
11
0
RT @IvanArcus: 🚀 Excited to announce the launch of the AISAR Scholarship, a new initiative to promote AI Safety research in Argentina! 🇦🇷….
0
3
0
Another cool paper by @a_jy_l 🚀.
🚨New preprint!. How do reasoning models verify their own CoT?.We reverse-engineer LMs and find critical components and subspaces needed for self-verification!. 1/n
0
0
6
RT @gasteigerjo: Paper Highlights, April '25:. - *AI Control for agents*.- Synthetic document finetuning.- Limits of scalable oversight.- E….
0
1
0
RT @MichaelTrazzi: "SB-1047: The Battle For The Future of AI". Full Documentary uncovering what really happened behind the scenes of the SB….
0
71
0
RT @JoshAEngels: 1/6: A recent paper shows that that LLMs are "self aware": when trained to exhibit a behavior like "risk taking", LLMs sel….
0
36
0