David Bau @davidbau X Profile

David Bau

@davidbau

Followers

6K

Following

1K

Media

167

Statuses

970

Computer Science Professor at Northeastern, Ex-Googler. Believes AI should be transparent. @[email protected] @davidbau.bsky.social https://t.co/wmP5LV0pJ4

https://t.co/8ZCMpGA8ii

Boston

Joined January 2009

Don't wanna be here? Send us removal request.

David Bau

@davidbau

1 year

What is the goal of interpretability in AI? I spoke a bit about this at the recent https://t.co/VHW4gU2EJI alignment workshop: https://t.co/XcVqFSKoYY The event had excellent talks from many others, worth a view, linked in the thread.

FAR.AI

@farairesearch

1 year

Vienna #AlignmentWorkshop: 129 researchers tackled #AISafety from interpretability & robustness to governance. Keynote by @janleike + talks by @vkrakovna @DavidSKrueger @ghadfield @RobertTrager @NeelNanda5 @davidbau @hlntnr @MaryPhuong10 and more. Blog recap & videos. 👇

1

11

84

David Bau

@davidbau

16 hours

This paper draws connections between key concepts: attribution, unlearning, knowledge, FIM, and activation covariance, and proposes a very cool way to split up the weights of a model to separate memorization from generalization. Insightful work @jack_merullo_!

Jack Merullo

@jack_merullo_

1 day

How is memorized data stored in a model? We disentangle MLP weights in LMs and ViTs into rank-1 components based on their curvature in the loss, and find representational signatures of both generalizing structure and memorized training data

1

8

155

CELSIUS Energy Drink

@CelsiusOfficial

15 days

Limited time. Lasting impression.

48

31

358

Tamar Rott Shaham

@TamarRottShaham

2 days

A key challenge for interpretability agents is knowing when they’ve understood enough to stop experimenting. Our @NeurIPSConf paper introduces a self-reflective agent that measures the reliability of its own explanations and stops once its understanding of models has converged.

2

26

44

David Bau

@davidbau

1 day

Paper, code, website - Please help reshare: https://t.co/kQOav8n70D

Arnab Sen Sharma

@arnab_api

3 days

Thanks to my collaborators @giordanoprogers , @NatalieShapira, and @davidbau. Checkout our paper for more details: 📜 https://t.co/A7cEMQlK7O 💻 https://t.co/kiwYl9UOHv 🌐

0

4

David Bau

@davidbau

1 day

When you read the paper, be sure to check out the appendix where @arnab_api discusses how pointer and value data are entangled in filters. And possible applications of the filter mechanism, like as a zero-shot "lie detector" that can flag incorrect statements in ordinary text.

1

0

4

CryptoAutos

@CryptoAutos

6 hours

bull market, is that you?

6

30

211

David Bau

@davidbau

1 day

Curiously, when the question precedes the list of candidates, there is an abstract predicate for "this is the answer I am looking for," that that tags items in a list as soon as they are seen.

Arnab Sen Sharma

@arnab_api

3 days

🎭 Plot twist: when the question is presented *before* the options, the causality scores drops to near zero! We investigate this further and find that when the question is presented first, the LM can rely on a different strategy: now it can *eagerly* evaluate each option as it

1

0

3

David Bau

@davidbau

1 day

And the neural representations for LLM filter heads are language independent! If we pick up the representation for a question in French, it will accurately match items expressed in the Thai language.

1

0

1

David Bau

@davidbau

1 day

Arnab calls predicate attention heads "filter heads" because the same heads filter many properties across objects, people, and landmarks. The generic structure resembles functional programming's "filter" function, with a common mechanism handling a wide range of predicates.

Arnab Sen Sharma

@arnab_api

3 days

🔍 In Llama-70B and Gemma-27B, we found special attention heads that consistently focus their attention on the filtered items. This behavior seems consistent across a range of different formats and semantic types.

1

0

1

David Bau

@davidbau

1 day

The secret life of an LM is defined by its internal data types. Inner layers transport abstractions that are more robust than words, like concepts, functions, or pointers. In new work yesterday, @arnab_api et al identify a data type for *predicates*.

Arnab Sen Sharma

@arnab_api

3 days

How can a language model find the veggies in a menu? New pre-print where we investigate the internal mechanisms of LLMs when filtering on a list of options. Spoiler: turns out LLMs use strategies surprisingly similar to functional programming (think "filter" from python)! 🧵

1

5

63

Timothy Young

@timyoung

3 days

Every marketing leader I talk to is facing the same challenge: Scale 📶 The demand for content has exploded, but the systems behind it haven’t kept up. Teams are expected to launch more campaigns, in more markets, with more data, all while maintaining brand integrity and speed.

18

6

64

David Bau

@davidbau

19 days

If you are at #ICCV2025 - the explainable computer vision eXCV workshop is about to start, in Ballroom A on the 4th floor!

Sweta Mahajan @ICCV2025

@SwetaMahajan1

22 days

Two days to go for the eXCV Workshop at #ICCV2025 ! Join us on Sunday from 08:45-17:40 in Ballroom A to hear about the state of XAI research from an exciting lineup of speakers! @Napoolar , @hila_chefer , @davidbau , Deepti Ghadiyaram, Viorica Patraucean, Sharon Li. (1/4)

1

50

David Bau

@davidbau

27 days

@ndif_team NDIF workbench team: design and implementation by Caden Juang @kh4dien, Avery Yen @avery_yen, Adam Belfki @adambelfki, and @_jon_bell_'s student Gwen Lincroft.

1

5

David Bau

@davidbau

28 days

Help me thank @ndif_team for rolling out https://t.co/R9hpMXuuND by using it to make your own discoveries inside LLM internals. We should all be looking inside our LLMs. Help share the tool by retweeting! Share what you find! And send the team feedback -

NDIF

@ndif_team

28 days

This is a public beta, so we expect bugs and actively want your feedback:

1

15

David Bau

@davidbau

28 days

That process was noticed by @wendlerch in https://t.co/uaYhmvcC8h and seen by @sheridan_feucht in https://t.co/uNNKAij9lD Try it out yourself on https://t.co/R9hpMXuuND. Does it work with other words? Can you find interesting exceptions? How about prompts beyond translation?

dualroute.baulab.info

Do LLMs copy meaningful text by rote or by understanding meaning? Webpage for The Dual-Route Model of Induction (Feucht et al., 2025).

2

1

17

David Bau

@davidbau

28 days

The lens reveals: the model does NOT go directly from amore to "amor" or "amour" by just dropping or adding letters! Instead it first "thinks" about the (English) word "love". In other words: LLMs translate using *concepts*, not tokens.

1

14

Ronald Reagan Presidential Foundation & Institute

@RonaldReagan

3 days

45 years ago tonight, a new era began. Ronald Reagan was elected 40th President of the United States. He is pictured celebrating with the woman who would redefine the role of First Lady, Nancy Reagan.

21

146

735

David Bau

@davidbau

28 days

Enter a translation prompt: "Italiano: amore, Español: amor, François:". The workbench doesn't just show you the model's output. It shows the grid of internal states that lead to the output. Researchers call this visualization the "logit lens".

1

11

David Bau

@davidbau

28 days

But why theorize? We can actually look at what it does. Visit the NDIF workbench here: https://t.co/R9hpMXuuND, and pull up any LLM that can translate, like GPT-J-6b. If you register an account you can access larger models.

NDIF

@ndif_team

28 days

Ever wished you could explore what's happening inside a 405B parameter model without writing any code? Workbench, our AI interpretability interface, is now live for public beta at https://t.co/L7s8vPfeds!

1

11

David Bau

@davidbau

28 days

What does an LLM do when it translates from Italian "amore" to Spanish "amor" or French "amour"? That's easy! (you might think) Because surely it knows: amore, amor, amour are all based on the same Latin word. It can just drop the "e", or add a "u".

1

4

54

David Bau

@davidbau

1 month

Looking forward to #COLM2025 tomorrow - don't forget to look for @sheridan_feucht 's poster. And DM me if you'll also be there and want to chat.

David Bau

@davidbau

1 month

Who is going to be at #COLM2025? I want to draw your attention to a COLM paper by my student @sheridan_feucht that has totally changed the way I think and teach about LLM representations. The work is worth knowing. And you meet Sheridan at COLM, Oct 7!

0

2

38

David Bau

@davidbau

1 month

On the Good Fight podcast with @Yascha_Mounk I give a quick but deep primer on how modern AI works. I also chat about our responsibility as machine learning scientists, and what we need to fix to get AI right. Take a listen and reshare! https://t.co/XstKhiZrGD

persuasion.community

Yascha Mounk and David Bau delve into the “black box” of AI.

2

4

27

NDIF

@ndif_team

1 month

Reminder that today is the deadline to apply for our hot-swapping program! Be the first to test out many new models remotely on NDIF and submit your application today! More details: https://t.co/KegNNZsuoQ Application link:

docs.google.com

Do you have a research project where you plan to study many different models? NDIF will soon be deploying model hot-swapping, which will enable users to access any HuggingFace model remotely via...

0

1

3

David Bau

@davidbau

1 month

Appearing in Computational Linguistics https://t.co/G1rRtOQPcK https://t.co/UOAWTPfoTP

direct.mit.edu

Abstract. Interpretability provides a toolset for understanding how and why language models behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations...

Aaron Mueller

@amuuueller

1 month

What's the right unit of analysis for understanding LLM internals? We explore in our mech interp survey (a major update from our 2024 ms). We’ve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!

0

2