
Julian Minder
@jkminder
Followers
213
Following
458
Media
21
Statuses
114
MATS 7 Scholar with Neel Nanda, CS Master at ETH Zürich, Incoming PhD at EPFL
London/Lausanne/Zürich
Joined November 2011
In our most recent work, we looked at how to best leverage crosscoders to identify representational differences between base and chat models. We find many cool things, e.g., a knowledge boundary, a detailed info and a humor/ joke detection latent.
New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning?. Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders. This finds interpretable and causal chat-only features! 🧵
0
0
23
A great and valuable blogpost by @paulcbogdan! I'd love to see a bit more statistical rigour in mech interp:).
One of my MATS scholars, @paulcbogdan, has a solid background in "how to use statistics to do rigorous science", and wrote a delightful post on how you can do this too!. He once wrote a paper studying the past 20 years of psychology papers, and trends in what replicated
0
0
18
RT @AmirZur2000: 1/6 🦉Did you know that telling an LLM that it loves the number 087 also makes it love owls?. In our new blogpost, It's Owl….
owls.baulab.info
Entangled tokens help explain subliminal learning.
0
72
0
RT @Butanium_: ⚠️LLM interp folks be aware⚠️.In @huggingface transformers 4.54, Llama & Qwen layers now return residual stream directly (no….
0
2
0
RT @AnthropicAI: New Anthropic research: Persona vectors. Language models sometimes go haywire and slip into weird and unsettling personas….
0
937
0
RT @Jack_W_Lindsey: Update on a new interpretable decomposition method for LLMs -- sparse mixtures of linear transforms (MOLT). Preliminar….
0
21
0
RT @HCasademunt: Problem: Train LLM on insecure code → it becomes broadly misaligned.Solution: Add safety data? What if you can't?. Use int….
0
27
0
RT @OwainEvans_UK: New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only….
0
1K
0
What does this mean? Causal Abstraction - while still a promising framework - must explicitly constrain representational structure or include the notion of generalization, since our proof hinges on extreme overfitting. More detailed thread:
1/9 In our new interpretability paper, we analyse causal abstraction—the framework behind Distributed Alignment Search—and show it breaks when we remove linearity constraints on feature representations. We refer to this problem as the Non-Linear Representation Dilemma.
1
0
3
Causal Abstraction, the theory behind DAS, tests if a network realizes a given algorithm. We show (w/ @DenisSutte9310, T. Hofmann, @tpimentelms) that the theory collapses without the linear representation hypothesis—a problem we call the non-linear representation dilemma.
1
4
26
RT @SaiboGeng: 🚀 Excited to share our latest work at ICML 2025 — zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Toke….
0
6
0
RT @tpimentelms: In this new paper, w/ @DenisSutte9310, @jkminder, and T. Hofmann, we study *causal abstraction*, a formal specification of….
0
1
0
RT @tpimentelms: Mechanistic interpretability often relies on *interventions* to study how DNNs work. Are these interventions enough to gua….
0
28
0
Could this have caught OpenAI's sycophantic model update? Maybe! . Post: Paper Thread: Paper:
arxiv.org
Model diffing is the study of how fine-tuning changes a model's representations and internal algorithms. Many behaviors of interest are introduced during fine-tuning, and model diffing offers a...
New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning?. Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders. This finds interpretable and causal chat-only features! 🧵
0
0
10
With @Butanium_ and @NeelNanda5 we've just published a post on model diffing that extends our previous paper. Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.
2
8
104