Julian Minder Profile
Julian Minder

@jkminder

Followers
213
Following
458
Media
21
Statuses
114

MATS 7 Scholar with Neel Nanda, CS Master at ETH Zürich, Incoming PhD at EPFL

London/Lausanne/Zürich
Joined November 2011
Don't wanna be here? Send us removal request.
@jkminder
Julian Minder
5 months
In our most recent work, we looked at how to best leverage crosscoders to identify representational differences between base and chat models. We find many cool things, e.g., a knowledge boundary, a detailed info and a humor/ joke detection latent.
Tweet media one
@Butanium_
Clément Dumas
5 months
New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning?. Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders. This finds interpretable and causal chat-only features! 🧵
Tweet media one
0
0
23
@jkminder
Julian Minder
25 days
A great and valuable blogpost by @paulcbogdan! I'd love to see a bit more statistical rigour in mech interp:).
@NeelNanda5
Neel Nanda
25 days
One of my MATS scholars, @paulcbogdan, has a solid background in "how to use statistics to do rigorous science", and wrote a delightful post on how you can do this too!. He once wrote a paper studying the past 20 years of psychology papers, and trends in what replicated
Tweet media one
0
0
18
@grok
Grok
5 days
Join millions who have switched to Grok.
201
403
3K
@jkminder
Julian Minder
25 days
RT @AmirZur2000: 1/6 🦉Did you know that telling an LLM that it loves the number 087 also makes it love owls?. In our new blogpost, It's Owl….
owls.baulab.info
Entangled tokens help explain subliminal learning.
0
72
0
@jkminder
Julian Minder
29 days
RT @Butanium_: ⚠️LLM interp folks be aware⚠️.In @huggingface transformers 4.54, Llama & Qwen layers now return residual stream directly (no….
0
2
0
@jkminder
Julian Minder
30 days
RT @AnthropicAI: New Anthropic research: Persona vectors. Language models sometimes go haywire and slip into weird and unsettling personas….
0
937
0
@jkminder
Julian Minder
1 month
RT @Jack_W_Lindsey: Update on a new interpretable decomposition method for LLMs -- sparse mixtures of linear transforms (MOLT). Preliminar….
0
21
0
@jkminder
Julian Minder
1 month
RT @HCasademunt: Problem: Train LLM on insecure code → it becomes broadly misaligned.Solution: Add safety data? What if you can't?. Use int….
0
27
0
@jkminder
Julian Minder
1 month
RT @OwainEvans_UK: New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only….
0
1K
0
@jkminder
Julian Minder
1 month
RT @jxmnop: new blog post. "All AI Models Might Be The Same" . in which i explain the Platonic Representation Hypothesis, the idea behind u….
0
162
0
@jkminder
Julian Minder
2 months
Paper:
0
0
3
@jkminder
Julian Minder
2 months
What does this mean? Causal Abstraction - while still a promising framework - must explicitly constrain representational structure or include the notion of generalization, since our proof hinges on extreme overfitting. More detailed thread:
@DenisSutte9310
Denis Sutter
2 months
1/9 In our new interpretability paper, we analyse causal abstraction—the framework behind Distributed Alignment Search—and show it breaks when we remove linearity constraints on feature representations. We refer to this problem as the Non-Linear Representation Dilemma.
Tweet media one
1
0
3
@jkminder
Julian Minder
2 months
Our proofs show that, without assuming linear representations, any algorithm can be mapped onto any network. Experiments confirm this: by using non-linear representations we can map an Indirect-Object-Identification algorithm to randomly initialized language models.
Tweet media one
1
0
3
@jkminder
Julian Minder
2 months
Causal Abstraction, the theory behind DAS, tests if a network realizes a given algorithm. We show (w/ @DenisSutte9310, T. Hofmann, @tpimentelms) that the theory collapses without the linear representation hypothesis—a problem we call the non-linear representation dilemma.
Tweet media one
1
4
26
@jkminder
Julian Minder
2 months
RT @SaiboGeng: 🚀 Excited to share our latest work at ICML 2025 — zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Toke….
0
6
0
@jkminder
Julian Minder
2 months
RT @tpimentelms: In this new paper, w/ @DenisSutte9310, @jkminder, and T. Hofmann, we study *causal abstraction*, a formal specification of….
0
1
0
@jkminder
Julian Minder
2 months
RT @tpimentelms: Mechanistic interpretability often relies on *interventions* to study how DNNs work. Are these interventions enough to gua….
0
28
0
@jkminder
Julian Minder
2 months
RT @glnmario: Some personal news ✨.In September, I’m joining @ucl as Associate Professor of Computational Linguistics. I’ll be building a l….
0
16
0
@jkminder
Julian Minder
2 months
Could this have caught OpenAI's sycophantic model update? Maybe! . Post: Paper Thread: Paper:
Tweet card summary image
arxiv.org
Model diffing is the study of how fine-tuning changes a model's representations and internal algorithms. Many behaviors of interest are introduced during fine-tuning, and model diffing offers a...
@Butanium_
Clément Dumas
5 months
New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning?. Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders. This finds interpretable and causal chat-only features! 🧵
Tweet media one
0
0
10
@jkminder
Julian Minder
2 months
Our methods reveal interpretable features related to e.g. refusal detection, fake facts, or information about the model's identity. This highlights that model diffing is a promising research direction deserving more attention.
Tweet media one
1
0
10
@jkminder
Julian Minder
2 months
By comparing base and chat models, we found that one of the main existing technique (crosscoders) hallucinates differences due to how its sparsity is enforced. We fixed this and also found that just training an SAE on (chat - base) activations works surprisingly well.
1
0
8
@jkminder
Julian Minder
2 months
With @Butanium_ and @NeelNanda5 we've just published a post on model diffing that extends our previous paper. Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.
Tweet media one
2
8
104