amuuueller Profile Banner
Aaron Mueller Profile
Aaron Mueller

@amuuueller

Followers
2K
Following
3K
Media
57
Statuses
280

Asst. Prof. in CS at @BU_Tweets ≡ {Mechanistic, causal} {interpretability, computational linguistics} ≡ Formerly: PhD @jhuclsp

Boston, MA
Joined September 2015
Don't wanna be here? Send us removal request.
@amuuueller
Aaron Mueller
1 year
Excited this project is out! Using sparse feature circuits, we can explain and modify how LMs arrive at a behavior. In this thread, I want to highlight open directions where computational linguists can use sparse feature circuits. 🧵.
@saprmarks
Samuel Marks
1 year
Can we understand & edit unanticipated mechanisms in LMs?. We introduce sparse feature circuits, & use them to explain LM behaviors, discover & fix LM bugs, & build an automated interpretability pipeline! Preprint w/ @can_rager, @ericjmichaud_, @boknilev, @davidbau, @amuuueller
1
8
49
@amuuueller
Aaron Mueller
1 month
RT @OrgadHadas: We're presenting the Mechanistic Interpretability Benchmark (MIB) now! Come and chat - East 1205. Project led by @amuuuelle….
0
8
0
@amuuueller
Aaron Mueller
1 month
If you're at #ICML2025, chat with me, @sarahwiegreffe, Atticus, and others at our poster 11am - 1:30pm at East #1205! We're establishing a 𝗠echanistic 𝗜nterpretability 𝗕enchmark. We're planning to keep this a living benchmark; come by and share your ideas/hot takes!
Tweet media one
0
4
39
@amuuueller
Aaron Mueller
2 months
RT @nikhil07prakash: How do language models track mental states of each character in a story, often referred to as Theory of Mind?. Our rec….
0
97
0
@amuuueller
Aaron Mueller
3 months
RT @jowenpetty: How well can LLMs understand tasks with complex sets of instructions? We investigate through the lens of RELIC: REcognizing….
0
23
0
@amuuueller
Aaron Mueller
3 months
RT @jsrozner: BabyLMs first constructions: new study on usage-based language acquisition in LMs w/ @LAWeissweiler, @coryshain. Simple inter….
0
7
0
@amuuueller
Aaron Mueller
3 months
RT @davidbau: Dear MAGA friends,. I have been worrying about STEM in the US a lot, because right now the Senate is writing new laws that cu….
0
72
0
@amuuueller
Aaron Mueller
3 months
RT @_joestacey_: We have a new paper up on arXiv! 🥳🪇. The paper tries to improve the robustness of closed-source LLMs fine-tuned on NLI, as….
0
18
0
@amuuueller
Aaron Mueller
3 months
We still have a lot to learn in editing NN representations. To edit or steer, we cannot simply choose semantically relevant representations; we must choose the ones that will have the intended impact. As @peterbhase found, these are often distinct.
0
0
3
@amuuueller
Aaron Mueller
3 months
By limiting steering to output features, we recover >90% of the performance of the best supervised representation-based steering methods—and at some locations, we outperform them!.
1
0
4
@amuuueller
Aaron Mueller
3 months
We define the notion of an “output feature”, whose role is to increase p(some token(s)). Steering these gives better results than steering “input features”, whose role is to attend to concepts in the input. We propose fast methods to sort features into these categories.
1
0
3
@amuuueller
Aaron Mueller
3 months
SAEs have been found to massively underperform supervised methods for steering neural networks. In new work led by @dana_arad4, we find that this problem largely disappears if you select the right features!.
@dana_arad4
Dana Arad 🎗️
3 months
Tried steering with SAEs and found that not all features behave as expected?. Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵
Tweet media one
2
8
40
@amuuueller
Aaron Mueller
3 months
RT @tomerashuach: 🚨New paper at #ACL2025 Findings!.REVS: Unlearning Sensitive Information in LMs via Rank Editing in the Vocabulary Space.….
0
21
0
@amuuueller
Aaron Mueller
3 months
RT @tal_haklay: Our paper "Position-Aware Circuit Discovery" got accepted to ACL! 🎉. Huge thanks to my collaborators🙏.@OrgadHadas @davidbau….
0
30
0
@amuuueller
Aaron Mueller
3 months
RT @boknilev: BlackboxNLP will be co-located with #EMNLP2025 in Suzhou this November! 📷This edition will feature a new shared task on circu….
blackboxnlp.github.io
The Eight Workshop on Analyzing and Interpreting Neural Networks for NLP
0
22
0
@amuuueller
Aaron Mueller
4 months
RT @weGotlieb: 📣Paper Update 📣It’s bigger! It’s better! Even if the language models aren’t. 🤖New version of “Bigger is not always Better: T….
osf.io
Neural network language models can learn a surprising amount about language by predicting upcoming words in a corpus. Recent language technologies work has demonstrated that large performance...
0
5
0
@amuuueller
Aaron Mueller
4 months
RT @babyLMchallenge: Close your books, test time!.The evaluation pipelines are out, baselines are released and the challenge is on. There i….
0
5
0
@amuuueller
Aaron Mueller
4 months
Presenting sparse feature circuits today at 3pm-5:30pm! Come say hi at poster #495
Tweet media one
1
8
53
@amuuueller
Aaron Mueller
4 months
RT @yanaiela: 💡 New ICLR paper! 💡."On Linear Representations and Pretraining Data Frequency in Language Models":. We provide an explanation….
0
44
0
@amuuueller
Aaron Mueller
4 months
0
0
9
@amuuueller
Aaron Mueller
4 months
This was a huge collaboration with many great folks! If you get a chance, be sure to talk to Atticus Geiger, @sarahwiegreffe, @dana_arad4, @IvanArcus, @adambelfki, @yiksiux, @jadenfk23, @tal_haklay, @michaelwhanna, . .
1
0
8