Jack Lindsey @Jack_W_Lindsey X Profile

Jack Lindsey

@Jack_W_Lindsey

Followers

2K

Following

233

Media

48

Statuses

248

Neuroscience of AI brains @AnthropicAI. Previously neuroscience of real brains @cu_neurotheory.

Joined January 2019

Don't wanna be here? Send us removal request.

Jack Lindsey

@Jack_W_Lindsey

1 month

We’re releasing an open-source library and public interactive interface for tracing the internal “thoughts” of a language model. Now anyone can explore the inner workings of LLMs — and it only takes seconds!

Michael Hanna

@michaelwhanna

1 month

@mntssys and I are excited to announce circuit-tracer, a library that makes circuit-finding simple!. Just type in a sentence, and get out a circuit showing (some of) the features your model uses to predict the next token. Try it on @neuronpedia:

4

36

296

Jack Lindsey

@Jack_W_Lindsey

7 days

RT @chingfang17: Humans and animals can rapidly learn in new environments. What computations support this? We study the mechanisms of in-co….

0

54

0

Jack Lindsey

@Jack_W_Lindsey

1 month

You can also find a detailed tutorial on the code, including worked examples and analyses, at Huge thanks to @michaelwhanna and @mntssys for developing the library, @johnnylin for the Neuronpedia integration, and @mlpowered for orchestrating!.

0

2

7

Jack Lindsey

@Jack_W_Lindsey

1 month

This release implements the method - attribution graphs - from our recent paper, where we uncovered plans, goals, and complex circuits inside Claude. It supports Gemma 2-2B + Llama 3.2-1B, and can be extended to other models. Try making a graph at .

1

2

10

Jack Lindsey

@Jack_W_Lindsey

1 month

RT @AnthropicAI: Our interpretability team recently released research that traced the thoughts of a large language model. Now we’re open-s….

0

585

0

Jack Lindsey

@Jack_W_Lindsey

2 months

RT @ch402: The Anthropic Interpretability Team is planning a virtual Q&A to answer Qs about how we plan to make models safer, the role of t….

0

35

0

Jack Lindsey

@Jack_W_Lindsey

3 months

RT @Butanium_: New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning?. Anthropic introduced a tool for this: crosscode….

0

27

0

Jack Lindsey

@Jack_W_Lindsey

3 months

RT @emollick: There’s at least a dozen dissertations to be written from this paper by Anthropic alone, which gives us some insight into how….

0

54

0

Jack Lindsey

@Jack_W_Lindsey

3 months

Great summary of a fun finding from our paper.

john

@JohnBcde

3 months

0

3

49

Jack Lindsey

@Jack_W_Lindsey

3 months

Come join us! Especially colleagues from my past life in neuroscience — working on this stuff feels like dissecting a weird alien brain, except we get to run experiments at 100X speed. It’s so exciting!.

1

2

26

Jack Lindsey

@Jack_W_Lindsey

3 months

It turns actually *using* these tools to navigate the weird, messy guts of the model is a whole new challenge, requiring different skills and perspectives. I expect people with experience or interest in biology to play an important role. .

1

0

19

Jack Lindsey

@Jack_W_Lindsey

3 months

5. Understanding how AI works is in many ways a biology problem. A lot of work focuses on building mathematical and computational tools for inspecting models. We need those tools, like biologists need microscopes. But that’s just the first step….

1

3

18

Jack Lindsey

@Jack_W_Lindsey

3 months

4. We can’t trust the models to tell us how they work – they may not know themselves! My favorite example: we figured out the rough algorithm Claude uses to add two numbers. Then we asked it how it adds two numbers. The answer it gave was… not what it actually does!.

4

0

19

Jack Lindsey

@Jack_W_Lindsey

3 months

3. Other times, they seem alien, and their mind oddly “fractured.” E.g. you can get Claude to start telling you how to make a bomb without “realizing” what it’s saying. Even when it does realize, its commitment to finishing sentences outweighs its reluctance to do harm. Weird!.

1

0

19

Jack Lindsey

@Jack_W_Lindsey

3 months

2. The steps models use often feel very… thoughtful. Planning responses ahead of time, representing goals, considering multiple possibilities at once. Sometimes they are alarmingly clever - in one case we see the model work backwards from a predetermined answer to justify it.

1

2

19

Jack Lindsey

@Jack_W_Lindsey

3 months

For me, there are five profound takeaways from this work. 1. Modern language models’ internal computation can be decomposed into steps that we can make some sense of. Our tools are imperfect and miss a lot – but there’s a lot we can see and understand already!.

1

0

21

Jack Lindsey

@Jack_W_Lindsey

3 months

Human thought is built out of billions of cellular computations each second. Language models also perform billions of computations for each word they write. But do these form a coherent “thought process?”. We’re starting to build tools to find out! Some reflections in thread.

Anthropic

@AnthropicAI

3 months

New Anthropic research: Tracing the thoughts of a large language model. We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.

5

22

198

Jack Lindsey

@Jack_W_Lindsey

4 months

I participated in this as an auditor, poking around in an LLM's brain to find its evil secrets. Most fun I've had at work! Very clever + thoughtful work by the lead authors in designing the model + the game, which set a precedent for how we can validate safety auditing techniques.

Anthropic

@AnthropicAI

4 months

New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?

1

8

84

Jack Lindsey

@Jack_W_Lindsey

5 months

RT @leedsharkey: Big new review! . 🟦Open Problems in Mechanistic Interpretability🟦. We bring together perspectives from ~30 top researchers….

0

93

0

Jack Lindsey

@Jack_W_Lindsey

7 months

If you’re interested in interpretability of LLMs, or any other AI safety-related topics, consider applying to Anthropic’s new Fellows program! Deadline January 20, but applications are reviewed on a rolling basis so earlier is better if you can. I’ll be one of the mentors!.

Anthropic

@AnthropicAI

7 months

We’re starting a Fellows program to help engineers and researchers transition into doing frontier AI safety research full-time. Beginning in March 2025, we'll provide funding, compute, and research mentorship to 10–15 Fellows with strong coding and technical backgrounds.

0

26

177