Julian Minder Profile
Julian Minder

@jkminder

Followers
505
Following
646
Media
33
Statuses
167

PhD at EPFL with Robert West and Ryan Cotterell, MATS 7 Scholar with Neel Nanda

Lausanne/Zürich
Joined November 2011
Don't wanna be here? Send us removal request.
@jkminder
Julian Minder
8 days
New paper: Finetuning on narrow domains leaves traces behind. By looking at the difference in activations before and after finetuning, we can interpret what it was finetuned for. And so can our interpretability agent! 🧵
2
28
147
@cervisiarius
Bob West
4 hours
📄✨Excited to share our new paper accepted to #EMNLP ’25: Combining Constrained and Unconstrained Decoding via Boosting: BoostCD and Its Application to Information Extraction https://t.co/ljsWULBHEA (led by #EPFL PhD student Marija Šakota -- soon on the job market, hire her!!)
1
5
7
@GladiaLab
GLADIA Research Lab
1 day
LLMs are injective and invertible. In our new paper, we show that different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space. (1/6)
117
511
4K
@TonyWangIV
Tony Wang
5 days
New paper! We show how to give an LLM the ability to accurately verbalize what changed about itself after a weight update is applied. We see this as a proof of concept for a new, more scalable approach to interpretability.🧵
13
55
563
@johnschulman2
John Schulman
7 days
Fine-tuning APIs are becoming more powerful and widespread, but they're harder to safeguard against misuse than fixed-weight sampling APIs. Excited to share a new paper: Detecting Adversarial Fine-tuning with Auditing Agents ( https://t.co/NqMeGSCQIF). Auditing agents search
Tweet card summary image
arxiv.org
Large Language Model (LLM) providers expose fine-tuning APIs that let end users fine-tune their frontier LLMs. Unfortunately, it has been shown that an adversary with fine-tuning access to an LLM...
10
49
462
@jkminder
Julian Minder
6 days
How can we reliably insert facts into models? @StewartSlocum1 developed a toolset to measure how well different methods work and finds that only training on synthetically generated documents (SDF) holds up.
@StewartSlocum1
Stewart Slocum
6 days
Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not
0
0
7
@StewartSlocum1
Stewart Slocum
6 days
Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not
5
37
178
@jkminder
Julian Minder
8 days
Takeaways: ALWAYS mix in data when building model organisms that should serve as proxies for more naturally emerging behaviors. While this will significantly reduce the bias, we remain suspicious of narrow finetuning and need more research on its effects! (8/9)
2
0
8
@jkminder
Julian Minder
8 days
A study of possible fixes shows that mixing in unrelated data during finetuning mostly removes the bias, but small factors remain. (7/9)
1
0
8
@jkminder
Julian Minder
8 days
We further deep dive into why this happens by showing that the traces represent constant biases of the training data. Ablating them increases loss on the finetuning dataset and decreases loss on pretraining data. (6/9)
1
0
8
@jkminder
Julian Minder
8 days
Our paper adds extended analysis with multiple agent models (no difference between GPT-5 and Gemini 2.5 Pro!) and statistical evaluation via @AISecurityInst HiBayes, showing that access to activation-difference tools (ADL) is the key driver of agent performance. (5/9)
1
0
7
@jkminder
Julian Minder
8 days
We then use interpretability agents to evaluate the claim that this information contains important insights into the finetuning objective - the agent with access to these tools significantly outperforms pure blackbox agents! (4/9)
1
0
6
@jkminder
Julian Minder
8 days
Recap: We compute activation differences between a base and finetuned model on the first few tokens of unrelated text & inspect them with Patchscope and by steering the finetuned model with the differences. This reveals the semantics and structure of the finetuning data. (3/9)
1
0
7
@jkminder
Julian Minder
8 days
Researchers often use narrowly finetuned models to practice: give them interesting properties and test their methods. It's key to use more realistic training schemes! We extend on our previous blogpost by providing more insights. (2/9)
@jkminder
Julian Minder
2 months
Can we interpret what happens in finetuning? Yes, if for a narrow domain! Narrow fine tuning leaves traces behind. By comparing activations before and after fine-tuning we can interpret these, even with an agent! We interpret subliminal learning, emergent misalignment, and more
1
0
8
@jkminder
Julian Minder
11 days
How can we make sure the CoT of future models will remain human understandable? Tandem Training!
@cervisiarius
Bob West
11 days
🚨New paper alert! 🚨 Tandem Training for Language Models https://t.co/Emzcgf1KHx Actions & thoughts of AI w/ superhuman skills will be hard for humans to follow, undermining human oversight of AI. We propose a new way to make AI produce human-understandable solutions. How?👉🧵
0
0
7
@cvenhoff00
Constantin Venhoff
18 days
🚨 What do reasoning models actually learn during training? Our new paper shows base models already contain reasoning mechanisms, thinking models learn when to use them! By invoking those skills at the right time in the base model, we recover up to 91% of the performance gap 🧵
16
72
586
@tpimentelms
Tiago Pimentel
27 days
LLMs are trained to mimic a “true” distribution—a reducing cross-entropy then confirms they get closer to this target while training. Do similar models approach the target in similar ways, though?🤔Not really! Our new paper studies this, finding 4-convergence phases in training🧵
2
16
126
@tpimentelms
Tiago Pimentel
27 days
Late to the party, but very happy this paper got accepted to NeurIPS 2025 as a Spotlight! 😁 Main takeaway: Without prior assumptions about how DNNs encode concepts in their representations (eg, the linear representation hypothesis), we can claim any DNN implements any algorithm
@tpimentelms
Tiago Pimentel
4 months
Mechanistic interpretability often relies on *interventions* to study how DNNs work. Are these interventions enough to guarantee the features we find are not spurious? No!⚠️ In our new paper, we show many mech int methods implicitly rely on the linear representation hypothesis🧵
1
19
137
@wendlerch
Chris Wendler
29 days
Check out @jkminder s master thesis on “circuit dynamics” during finetuning. Almost one year later now, I still find myself revisiting it often during my own research. E.g., the recursive transformer chapter for a clean, mechint-ready notation. Well deserved ETH Zurich medal 🥇
@jkminder
Julian Minder
29 days
My master's thesis "Understanding the Surfacing of Capabilities in Language Models", has been awarded the ETH Medal 🏅for Outstanding Thesis. Huge thanks to my supervisors @wendlerch @cervisiarius! https://t.co/CLwavKQDX5 Thesis:
0
1
9