Julian Minder @jkminder X Profile

Julian Minder

@jkminder

Followers

505

Following

646

Media

33

Statuses

167

PhD at EPFL with Robert West and Ryan Cotterell, MATS 7 Scholar with Neel Nanda

https://t.co/ICVJ82ZfNv

Lausanne/Zürich

Joined November 2011

Don't wanna be here? Send us removal request.

Julian Minder

@jkminder

8 days

New paper: Finetuning on narrow domains leaves traces behind. By looking at the difference in activations before and after finetuning, we can interpret what it was finetuned for. And so can our interpretability agent! 🧵

2

28

147

Bob West

@cervisiarius

4 hours

📄✨Excited to share our new paper accepted to #EMNLP ’25: Combining Constrained and Unconstrained Decoding via Boosting: BoostCD and Its Application to Information Extraction https://t.co/ljsWULBHEA (led by #EPFL PhD student Marija Šakota -- soon on the job market, hire her!!)

1

5

7

GLADIA Research Lab

@GladiaLab

1 day

LLMs are injective and invertible. In our new paper, we show that different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space. (1/6)

117

511

4K

Tony Wang

@TonyWangIV

5 days

New paper! We show how to give an LLM the ability to accurately verbalize what changed about itself after a weight update is applied. We see this as a proof of concept for a new, more scalable approach to interpretability.🧵

13

55

563

John Schulman

@johnschulman2

7 days

Fine-tuning APIs are becoming more powerful and widespread, but they're harder to safeguard against misuse than fixed-weight sampling APIs. Excited to share a new paper: Detecting Adversarial Fine-tuning with Auditing Agents ( https://t.co/NqMeGSCQIF). Auditing agents search

arxiv.org

Large Language Model (LLM) providers expose fine-tuning APIs that let end users fine-tune their frontier LLMs. Unfortunately, it has been shown that an adversary with fine-tuning access to an LLM...

10

49

462

Julian Minder

@jkminder

6 days

How can we reliably insert facts into models? @StewartSlocum1 developed a toolset to measure how well different methods work and finds that only training on synthetically generated documents (SDF) holds up.

Stewart Slocum

@StewartSlocum1

6 days

Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not

0

7

Stewart Slocum

@StewartSlocum1

6 days

Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not

5

37

178

Julian Minder

@jkminder

8 days

Huge thanks to my amazing co-authors @Butanium_ @StewartSlocum1 @HCasademunt @CameronHolmes92 @cervisiarius @NeelNanda5 Paper: https://t.co/6T7WczCgBL (9/9)

arxiv.org

Finetuning on narrow domains has become an essential tool to adapt Large Language Models (LLMs) to specific tasks and to create models with known unusual properties that are useful for research....

1

0

15

Julian Minder

@jkminder

8 days

Takeaways: ALWAYS mix in data when building model organisms that should serve as proxies for more naturally emerging behaviors. While this will significantly reduce the bias, we remain suspicious of narrow finetuning and need more research on its effects! (8/9)

2

0

8

Julian Minder

@jkminder

8 days

A study of possible fixes shows that mixing in unrelated data during finetuning mostly removes the bias, but small factors remain. (7/9)

1

0

8

Julian Minder

@jkminder

8 days

We further deep dive into why this happens by showing that the traces represent constant biases of the training data. Ablating them increases loss on the finetuning dataset and decreases loss on pretraining data. (6/9)

1

0

8

Julian Minder

@jkminder

8 days

Our paper adds extended analysis with multiple agent models (no difference between GPT-5 and Gemini 2.5 Pro!) and statistical evaluation via @AISecurityInst HiBayes, showing that access to activation-difference tools (ADL) is the key driver of agent performance. (5/9)

1

0

7

Julian Minder

@jkminder

8 days

We then use interpretability agents to evaluate the claim that this information contains important insights into the finetuning objective - the agent with access to these tools significantly outperforms pure blackbox agents! (4/9)

1

0

6

Julian Minder

@jkminder

8 days

Recap: We compute activation differences between a base and finetuned model on the first few tokens of unrelated text & inspect them with Patchscope and by steering the finetuned model with the differences. This reveals the semantics and structure of the finetuning data. (3/9)

1

0

7

Julian Minder

@jkminder

8 days

Researchers often use narrowly finetuned models to practice: give them interesting properties and test their methods. It's key to use more realistic training schemes! We extend on our previous blogpost by providing more insights. (2/9)

Julian Minder

@jkminder

2 months

Can we interpret what happens in finetuning? Yes, if for a narrow domain! Narrow fine tuning leaves traces behind. By comparing activations before and after fine-tuning we can interpret these, even with an agent! We interpret subliminal learning, emergent misalignment, and more

1

0

8

Julian Minder

@jkminder

11 days

How can we make sure the CoT of future models will remain human understandable? Tandem Training!

Bob West

@cervisiarius

11 days

🚨New paper alert! 🚨 Tandem Training for Language Models https://t.co/Emzcgf1KHx Actions & thoughts of AI w/ superhuman skills will be hard for humans to follow, undermining human oversight of AI. We propose a new way to make AI produce human-understandable solutions. How?👉🧵

0

7

Constantin Venhoff

@cvenhoff00

18 days

🚨 What do reasoning models actually learn during training? Our new paper shows base models already contain reasoning mechanisms, thinking models learn when to use them! By invoking those skills at the right time in the base model, we recover up to 91% of the performance gap 🧵

16

72

586

Tiago Pimentel

@tpimentelms

27 days

LLMs are trained to mimic a “true” distribution—a reducing cross-entropy then confirms they get closer to this target while training. Do similar models approach the target in similar ways, though?🤔Not really! Our new paper studies this, finding 4-convergence phases in training🧵

2

16

126

Tiago Pimentel

@tpimentelms

27 days

Late to the party, but very happy this paper got accepted to NeurIPS 2025 as a Spotlight! 😁 Main takeaway: Without prior assumptions about how DNNs encode concepts in their representations (eg, the linear representation hypothesis), we can claim any DNN implements any algorithm

Tiago Pimentel

@tpimentelms

4 months

Mechanistic interpretability often relies on *interventions* to study how DNNs work. Are these interventions enough to guarantee the features we find are not spurious? No!⚠️ In our new paper, we show many mech int methods implicitly rely on the linear representation hypothesis🧵

1

19

137

Chris Wendler

@wendlerch

29 days

Check out @jkminder s master thesis on “circuit dynamics” during finetuning. Almost one year later now, I still find myself revisiting it often during my own research. E.g., the recursive transformer chapter for a clean, mechint-ready notation. Well deserved ETH Zurich medal 🥇

Julian Minder

@jkminder

29 days

My master's thesis "Understanding the Surfacing of Capabilities in Language Models", has been awarded the ETH Medal 🏅for Outstanding Thesis. Huge thanks to my supervisors @wendlerch @cervisiarius! https://t.co/CLwavKQDX5 Thesis:

0

1

9