davidbau Profile Banner
David Bau Profile
David Bau

@davidbau

Followers
6K
Following
1K
Media
167
Statuses
970

Computer Science Professor at Northeastern, Ex-Googler. Believes AI should be transparent. @[email protected] @davidbau.bsky.social https://t.co/wmP5LV0pJ4

Boston
Joined January 2009
Don't wanna be here? Send us removal request.
@davidbau
David Bau
1 year
What is the goal of interpretability in AI? I spoke a bit about this at the recent https://t.co/VHW4gU2EJI alignment workshop: https://t.co/XcVqFSKoYY The event had excellent talks from many others, worth a view, linked in the thread.
@farairesearch
FAR.AI
1 year
Vienna #AlignmentWorkshop: 129 researchers tackled #AISafety from interpretability & robustness to governance. Keynote by @janleike + talks by @vkrakovna @DavidSKrueger @ghadfield @RobertTrager @NeelNanda5 @davidbau @hlntnr @MaryPhuong10 and more. Blog recap & videos. 👇
1
11
84
@davidbau
David Bau
16 hours
This paper draws connections between key concepts: attribution, unlearning, knowledge, FIM, and activation covariance, and proposes a very cool way to split up the weights of a model to separate memorization from generalization. Insightful work @jack_merullo_!
@jack_merullo_
Jack Merullo
1 day
How is memorized data stored in a model? We disentangle MLP weights in LMs and ViTs into rank-1 components based on their curvature in the loss, and find representational signatures of both generalizing structure and memorized training data
1
8
155
@CelsiusOfficial
CELSIUS Energy Drink
15 days
Limited time. Lasting impression.
48
31
358
@TamarRottShaham
Tamar Rott Shaham
2 days
A key challenge for interpretability agents is knowing when they’ve understood enough to stop experimenting. Our @NeurIPSConf paper introduces a self-reflective agent that measures the reliability of its own explanations and stops once its understanding of models has converged.
2
26
44
@davidbau
David Bau
1 day
Paper, code, website - Please help reshare: https://t.co/kQOav8n70D
@arnab_api
Arnab Sen Sharma
3 days
Thanks to my collaborators @giordanoprogers , @NatalieShapira, and @davidbau. Checkout our paper for more details: 📜 https://t.co/A7cEMQlK7O 💻 https://t.co/kiwYl9UOHv 🌐
0
0
4
@davidbau
David Bau
1 day
When you read the paper, be sure to check out the appendix where @arnab_api discusses how pointer and value data are entangled in filters. And possible applications of the filter mechanism, like as a zero-shot "lie detector" that can flag incorrect statements in ordinary text.
1
0
4
@CryptoAutos
CryptoAutos
6 hours
bull market, is that you?
6
30
211
@davidbau
David Bau
1 day
Curiously, when the question precedes the list of candidates, there is an abstract predicate for "this is the answer I am looking for," that that tags items in a list as soon as they are seen.
@arnab_api
Arnab Sen Sharma
3 days
🎭 Plot twist: when the question is presented *before* the options, the causality scores drops to near zero! We investigate this further and find that when the question is presented first, the LM can rely on a different strategy: now it can *eagerly* evaluate each option as it
1
0
3
@davidbau
David Bau
1 day
And the neural representations for LLM filter heads are language independent! If we pick up the representation for a question in French, it will accurately match items expressed in the Thai language.
1
0
1
@davidbau
David Bau
1 day
Arnab calls predicate attention heads "filter heads" because the same heads filter many properties across objects, people, and landmarks. The generic structure resembles functional programming's "filter" function, with a common mechanism handling a wide range of predicates.
@arnab_api
Arnab Sen Sharma
3 days
🔍 In Llama-70B and Gemma-27B, we found special attention heads that consistently focus their attention on the filtered items. This behavior seems consistent across a range of different formats and semantic types.
1
0
1
@davidbau
David Bau
1 day
The secret life of an LM is defined by its internal data types. Inner layers transport abstractions that are more robust than words, like concepts, functions, or pointers. In new work yesterday, @arnab_api et al identify a data type for *predicates*.
@arnab_api
Arnab Sen Sharma
3 days
How can a language model find the veggies in a menu? New pre-print where we investigate the internal mechanisms of LLMs when filtering on a list of options. Spoiler: turns out LLMs use strategies surprisingly similar to functional programming (think "filter" from python)! 🧵
1
5
63
@timyoung
Timothy Young
3 days
Every marketing leader I talk to is facing the same challenge: Scale 📶 The demand for content has exploded, but the systems behind it haven’t kept up. Teams are expected to launch more campaigns, in more markets, with more data, all while maintaining brand integrity and speed.
18
6
64
@davidbau
David Bau
19 days
If you are at #ICCV2025 - the explainable computer vision eXCV workshop is about to start, in Ballroom A on the 4th floor!
@SwetaMahajan1
Sweta Mahajan @ICCV2025
22 days
Two days to go for the eXCV Workshop at #ICCV2025 ! Join us on Sunday from 08:45-17:40 in Ballroom A to hear about the state of XAI research from an exciting lineup of speakers! @Napoolar , @hila_chefer , @davidbau , Deepti Ghadiyaram, Viorica Patraucean, Sharon Li. (1/4)
1
1
50
@davidbau
David Bau
27 days
@ndif_team NDIF workbench team: design and implementation by Caden Juang @kh4dien, Avery Yen @avery_yen, Adam Belfki @adambelfki, and @_jon_bell_'s student Gwen Lincroft.
1
1
5
@davidbau
David Bau
28 days
Help me thank @ndif_team for rolling out https://t.co/R9hpMXuuND by using it to make your own discoveries inside LLM internals. We should all be looking inside our LLMs. Help share the tool by retweeting! Share what you find! And send the team feedback -
@ndif_team
NDIF
28 days
This is a public beta, so we expect bugs and actively want your feedback:
1
1
15
@davidbau
David Bau
28 days
That process was noticed by @wendlerch in https://t.co/uaYhmvcC8h and seen by @sheridan_feucht in https://t.co/uNNKAij9lD Try it out yourself on https://t.co/R9hpMXuuND. Does it work with other words? Can you find interesting exceptions? How about prompts beyond translation?
Tweet card summary image
dualroute.baulab.info
Do LLMs copy meaningful text by rote or by understanding meaning? Webpage for The Dual-Route Model of Induction (Feucht et al., 2025).
2
1
17
@davidbau
David Bau
28 days
The lens reveals: the model does NOT go directly from amore to "amor" or "amour" by just dropping or adding letters! Instead it first "thinks" about the (English) word "love". In other words: LLMs translate using *concepts*, not tokens.
1
1
14
@RonaldReagan
Ronald Reagan Presidential Foundation & Institute
3 days
45 years ago tonight, a new era began. Ronald Reagan was elected 40th President of the United States. He is pictured celebrating with the woman who would redefine the role of First Lady, Nancy Reagan.
21
146
735
@davidbau
David Bau
28 days
Enter a translation prompt: "Italiano: amore, Español: amor, François:". The workbench doesn't just show you the model's output. It shows the grid of internal states that lead to the output. Researchers call this visualization the "logit lens".
1
1
11
@davidbau
David Bau
28 days
But why theorize? We can actually look at what it does. Visit the NDIF workbench here: https://t.co/R9hpMXuuND, and pull up any LLM that can translate, like GPT-J-6b. If you register an account you can access larger models.
@ndif_team
NDIF
28 days
Ever wished you could explore what's happening inside a 405B parameter model without writing any code? Workbench, our AI interpretability interface, is now live for public beta at https://t.co/L7s8vPfeds!
1
1
11
@davidbau
David Bau
28 days
What does an LLM do when it translates from Italian "amore" to Spanish "amor" or French "amour"? That's easy! (you might think) Because surely it knows: amore, amor, amour are all based on the same Latin word. It can just drop the "e", or add a "u".
1
4
54
@davidbau
David Bau
1 month
Looking forward to #COLM2025 tomorrow - don't forget to look for @sheridan_feucht 's poster. And DM me if you'll also be there and want to chat.
@davidbau
David Bau
1 month
Who is going to be at #COLM2025? I want to draw your attention to a COLM paper by my student @sheridan_feucht that has totally changed the way I think and teach about LLM representations. The work is worth knowing. And you meet Sheridan at COLM, Oct 7!
0
2
38
@davidbau
David Bau
1 month
On the Good Fight podcast with @Yascha_Mounk I give a quick but deep primer on how modern AI works. I also chat about our responsibility as machine learning scientists, and what we need to fix to get AI right. Take a listen and reshare! https://t.co/XstKhiZrGD
Tweet card summary image
persuasion.community
Yascha Mounk and David Bau delve into the “black box” of AI.
2
4
27
@ndif_team
NDIF
1 month
Reminder that today is the deadline to apply for our hot-swapping program! Be the first to test out many new models remotely on NDIF and submit your application today! More details: https://t.co/KegNNZsuoQ Application link:
Tweet card summary image
docs.google.com
Do you have a research project where you plan to study many different models? NDIF will soon be deploying model hot-swapping, which will enable users to access any HuggingFace model remotely via...
0
1
3
@davidbau
David Bau
1 month
Appearing in Computational Linguistics https://t.co/G1rRtOQPcK https://t.co/UOAWTPfoTP
Tweet card summary image
direct.mit.edu
Abstract. Interpretability provides a toolset for understanding how and why language models behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations...
@amuuueller
Aaron Mueller
1 month
What's the right unit of analysis for understanding LLM internals? We explore in our mech interp survey (a major update from our 2024 ms). We’ve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!
0
0
2