Julian Minder
@jkminder
Followers
505
Following
646
Media
33
Statuses
167
PhD at EPFL with Robert West and Ryan Cotterell, MATS 7 Scholar with Neel Nanda
Lausanne/Zürich
Joined November 2011
New paper: Finetuning on narrow domains leaves traces behind. By looking at the difference in activations before and after finetuning, we can interpret what it was finetuned for. And so can our interpretability agent! 🧵
2
28
147
📄✨Excited to share our new paper accepted to #EMNLP ’25: Combining Constrained and Unconstrained Decoding via Boosting: BoostCD and Its Application to Information Extraction https://t.co/ljsWULBHEA (led by #EPFL PhD student Marija Šakota -- soon on the job market, hire her!!)
1
5
7
LLMs are injective and invertible. In our new paper, we show that different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space. (1/6)
117
511
4K
New paper! We show how to give an LLM the ability to accurately verbalize what changed about itself after a weight update is applied. We see this as a proof of concept for a new, more scalable approach to interpretability.🧵
13
55
563
Fine-tuning APIs are becoming more powerful and widespread, but they're harder to safeguard against misuse than fixed-weight sampling APIs. Excited to share a new paper: Detecting Adversarial Fine-tuning with Auditing Agents ( https://t.co/NqMeGSCQIF). Auditing agents search
arxiv.org
Large Language Model (LLM) providers expose fine-tuning APIs that let end users fine-tune their frontier LLMs. Unfortunately, it has been shown that an adversary with fine-tuning access to an LLM...
10
49
462
How can we reliably insert facts into models? @StewartSlocum1 developed a toolset to measure how well different methods work and finds that only training on synthetically generated documents (SDF) holds up.
Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not
0
0
7
Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not
5
37
178
Huge thanks to my amazing co-authors @Butanium_ @StewartSlocum1 @HCasademunt @CameronHolmes92 @cervisiarius @NeelNanda5 Paper: https://t.co/6T7WczCgBL (9/9)
arxiv.org
Finetuning on narrow domains has become an essential tool to adapt Large Language Models (LLMs) to specific tasks and to create models with known unusual properties that are useful for research....
1
0
15
Takeaways: ALWAYS mix in data when building model organisms that should serve as proxies for more naturally emerging behaviors. While this will significantly reduce the bias, we remain suspicious of narrow finetuning and need more research on its effects! (8/9)
2
0
8
A study of possible fixes shows that mixing in unrelated data during finetuning mostly removes the bias, but small factors remain. (7/9)
1
0
8
We further deep dive into why this happens by showing that the traces represent constant biases of the training data. Ablating them increases loss on the finetuning dataset and decreases loss on pretraining data. (6/9)
1
0
8
Our paper adds extended analysis with multiple agent models (no difference between GPT-5 and Gemini 2.5 Pro!) and statistical evaluation via @AISecurityInst HiBayes, showing that access to activation-difference tools (ADL) is the key driver of agent performance. (5/9)
1
0
7
We then use interpretability agents to evaluate the claim that this information contains important insights into the finetuning objective - the agent with access to these tools significantly outperforms pure blackbox agents! (4/9)
1
0
6
Recap: We compute activation differences between a base and finetuned model on the first few tokens of unrelated text & inspect them with Patchscope and by steering the finetuned model with the differences. This reveals the semantics and structure of the finetuning data. (3/9)
1
0
7
Researchers often use narrowly finetuned models to practice: give them interesting properties and test their methods. It's key to use more realistic training schemes! We extend on our previous blogpost by providing more insights. (2/9)
Can we interpret what happens in finetuning? Yes, if for a narrow domain! Narrow fine tuning leaves traces behind. By comparing activations before and after fine-tuning we can interpret these, even with an agent! We interpret subliminal learning, emergent misalignment, and more
1
0
8
How can we make sure the CoT of future models will remain human understandable? Tandem Training!
🚨New paper alert! 🚨 Tandem Training for Language Models https://t.co/Emzcgf1KHx Actions & thoughts of AI w/ superhuman skills will be hard for humans to follow, undermining human oversight of AI. We propose a new way to make AI produce human-understandable solutions. How?👉🧵
0
0
7
🚨 What do reasoning models actually learn during training? Our new paper shows base models already contain reasoning mechanisms, thinking models learn when to use them! By invoking those skills at the right time in the base model, we recover up to 91% of the performance gap 🧵
16
72
586
LLMs are trained to mimic a “true” distribution—a reducing cross-entropy then confirms they get closer to this target while training. Do similar models approach the target in similar ways, though?🤔Not really! Our new paper studies this, finding 4-convergence phases in training🧵
2
16
126
Late to the party, but very happy this paper got accepted to NeurIPS 2025 as a Spotlight! 😁 Main takeaway: Without prior assumptions about how DNNs encode concepts in their representations (eg, the linear representation hypothesis), we can claim any DNN implements any algorithm
Mechanistic interpretability often relies on *interventions* to study how DNNs work. Are these interventions enough to guarantee the features we find are not spurious? No!⚠️ In our new paper, we show many mech int methods implicitly rely on the linear representation hypothesis🧵
1
19
137
Check out @jkminder s master thesis on “circuit dynamics” during finetuning. Almost one year later now, I still find myself revisiting it often during my own research. E.g., the recursive transformer chapter for a clean, mechint-ready notation. Well deserved ETH Zurich medal 🥇
My master's thesis "Understanding the Surfacing of Capabilities in Language Models", has been awarded the ETH Medal 🏅for Outstanding Thesis. Huge thanks to my supervisors @wendlerch @cervisiarius! https://t.co/CLwavKQDX5 Thesis:
0
1
9