David Bau
@davidbau
Followers
6K
Following
1K
Media
167
Statuses
970
Computer Science Professor at Northeastern, Ex-Googler. Believes AI should be transparent. @[email protected] @davidbau.bsky.social https://t.co/wmP5LV0pJ4
Boston
Joined January 2009
What is the goal of interpretability in AI? I spoke a bit about this at the recent https://t.co/VHW4gU2EJI alignment workshop: https://t.co/XcVqFSKoYY The event had excellent talks from many others, worth a view, linked in the thread.
Vienna #AlignmentWorkshop: 129 researchers tackled #AISafety from interpretability & robustness to governance. Keynote by @janleike + talks by @vkrakovna @DavidSKrueger @ghadfield @RobertTrager @NeelNanda5 @davidbau @hlntnr @MaryPhuong10 and more. Blog recap & videos. 👇
1
11
84
This paper draws connections between key concepts: attribution, unlearning, knowledge, FIM, and activation covariance, and proposes a very cool way to split up the weights of a model to separate memorization from generalization. Insightful work @jack_merullo_!
How is memorized data stored in a model? We disentangle MLP weights in LMs and ViTs into rank-1 components based on their curvature in the loss, and find representational signatures of both generalizing structure and memorized training data
1
8
155
A key challenge for interpretability agents is knowing when they’ve understood enough to stop experimenting. Our @NeurIPSConf paper introduces a self-reflective agent that measures the reliability of its own explanations and stops once its understanding of models has converged.
2
26
44
Paper, code, website - Please help reshare: https://t.co/kQOav8n70D
Thanks to my collaborators @giordanoprogers , @NatalieShapira, and @davidbau. Checkout our paper for more details: 📜 https://t.co/A7cEMQlK7O 💻 https://t.co/kiwYl9UOHv 🌐
0
0
4
When you read the paper, be sure to check out the appendix where @arnab_api discusses how pointer and value data are entangled in filters. And possible applications of the filter mechanism, like as a zero-shot "lie detector" that can flag incorrect statements in ordinary text.
1
0
4
Curiously, when the question precedes the list of candidates, there is an abstract predicate for "this is the answer I am looking for," that that tags items in a list as soon as they are seen.
🎭 Plot twist: when the question is presented *before* the options, the causality scores drops to near zero! We investigate this further and find that when the question is presented first, the LM can rely on a different strategy: now it can *eagerly* evaluate each option as it
1
0
3
And the neural representations for LLM filter heads are language independent! If we pick up the representation for a question in French, it will accurately match items expressed in the Thai language.
1
0
1
Arnab calls predicate attention heads "filter heads" because the same heads filter many properties across objects, people, and landmarks. The generic structure resembles functional programming's "filter" function, with a common mechanism handling a wide range of predicates.
🔍 In Llama-70B and Gemma-27B, we found special attention heads that consistently focus their attention on the filtered items. This behavior seems consistent across a range of different formats and semantic types.
1
0
1
The secret life of an LM is defined by its internal data types. Inner layers transport abstractions that are more robust than words, like concepts, functions, or pointers. In new work yesterday, @arnab_api et al identify a data type for *predicates*.
How can a language model find the veggies in a menu? New pre-print where we investigate the internal mechanisms of LLMs when filtering on a list of options. Spoiler: turns out LLMs use strategies surprisingly similar to functional programming (think "filter" from python)! 🧵
1
5
63
Every marketing leader I talk to is facing the same challenge: Scale 📶 The demand for content has exploded, but the systems behind it haven’t kept up. Teams are expected to launch more campaigns, in more markets, with more data, all while maintaining brand integrity and speed.
18
6
64
If you are at #ICCV2025 - the explainable computer vision eXCV workshop is about to start, in Ballroom A on the 4th floor!
Two days to go for the eXCV Workshop at #ICCV2025 ! Join us on Sunday from 08:45-17:40 in Ballroom A to hear about the state of XAI research from an exciting lineup of speakers! @Napoolar , @hila_chefer , @davidbau , Deepti Ghadiyaram, Viorica Patraucean, Sharon Li. (1/4)
1
1
50
@ndif_team NDIF workbench team: design and implementation by Caden Juang @kh4dien, Avery Yen @avery_yen, Adam Belfki @adambelfki, and @_jon_bell_'s student Gwen Lincroft.
1
1
5
Help me thank @ndif_team for rolling out https://t.co/R9hpMXuuND by using it to make your own discoveries inside LLM internals. We should all be looking inside our LLMs. Help share the tool by retweeting! Share what you find! And send the team feedback -
1
1
15
That process was noticed by @wendlerch in https://t.co/uaYhmvcC8h and seen by @sheridan_feucht in https://t.co/uNNKAij9lD Try it out yourself on https://t.co/R9hpMXuuND. Does it work with other words? Can you find interesting exceptions? How about prompts beyond translation?
dualroute.baulab.info
Do LLMs copy meaningful text by rote or by understanding meaning? Webpage for The Dual-Route Model of Induction (Feucht et al., 2025).
2
1
17
The lens reveals: the model does NOT go directly from amore to "amor" or "amour" by just dropping or adding letters! Instead it first "thinks" about the (English) word "love". In other words: LLMs translate using *concepts*, not tokens.
1
1
14
45 years ago tonight, a new era began. Ronald Reagan was elected 40th President of the United States. He is pictured celebrating with the woman who would redefine the role of First Lady, Nancy Reagan.
21
146
735
Enter a translation prompt: "Italiano: amore, Español: amor, François:". The workbench doesn't just show you the model's output. It shows the grid of internal states that lead to the output. Researchers call this visualization the "logit lens".
1
1
11
But why theorize? We can actually look at what it does. Visit the NDIF workbench here: https://t.co/R9hpMXuuND, and pull up any LLM that can translate, like GPT-J-6b. If you register an account you can access larger models.
Ever wished you could explore what's happening inside a 405B parameter model without writing any code? Workbench, our AI interpretability interface, is now live for public beta at https://t.co/L7s8vPfeds!
1
1
11
What does an LLM do when it translates from Italian "amore" to Spanish "amor" or French "amour"? That's easy! (you might think) Because surely it knows: amore, amor, amour are all based on the same Latin word. It can just drop the "e", or add a "u".
1
4
54
Looking forward to #COLM2025 tomorrow - don't forget to look for @sheridan_feucht 's poster. And DM me if you'll also be there and want to chat.
Who is going to be at #COLM2025? I want to draw your attention to a COLM paper by my student @sheridan_feucht that has totally changed the way I think and teach about LLM representations. The work is worth knowing. And you meet Sheridan at COLM, Oct 7!
0
2
38
On the Good Fight podcast with @Yascha_Mounk I give a quick but deep primer on how modern AI works. I also chat about our responsibility as machine learning scientists, and what we need to fix to get AI right. Take a listen and reshare! https://t.co/XstKhiZrGD
persuasion.community
Yascha Mounk and David Bau delve into the “black box” of AI.
2
4
27
Reminder that today is the deadline to apply for our hot-swapping program! Be the first to test out many new models remotely on NDIF and submit your application today! More details: https://t.co/KegNNZsuoQ Application link:
docs.google.com
Do you have a research project where you plan to study many different models? NDIF will soon be deploying model hot-swapping, which will enable users to access any HuggingFace model remotely via...
0
1
3
Appearing in Computational Linguistics https://t.co/G1rRtOQPcK
https://t.co/UOAWTPfoTP
direct.mit.edu
Abstract. Interpretability provides a toolset for understanding how and why language models behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations...
What's the right unit of analysis for understanding LLM internals? We explore in our mech interp survey (a major update from our 2024 ms). We’ve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!
0
0
2