Jack Lindsey Profile
Jack Lindsey

@Jack_W_Lindsey

Followers
2K
Following
233
Media
48
Statuses
248

Neuroscience of AI brains @AnthropicAI. Previously neuroscience of real brains @cu_neurotheory.

Joined January 2019
Don't wanna be here? Send us removal request.
@Jack_W_Lindsey
Jack Lindsey
1 month
We’re releasing an open-source library and public interactive interface for tracing the internal “thoughts” of a language model. Now anyone can explore the inner workings of LLMs — and it only takes seconds!
Tweet media one
@michaelwhanna
Michael Hanna
1 month
@mntssys and I are excited to announce circuit-tracer, a library that makes circuit-finding simple!. Just type in a sentence, and get out a circuit showing (some of) the features your model uses to predict the next token. Try it on @neuronpedia:
Tweet media one
4
36
296
@Jack_W_Lindsey
Jack Lindsey
7 days
RT @chingfang17: Humans and animals can rapidly learn in new environments. What computations support this? We study the mechanisms of in-co….
0
54
0
@Jack_W_Lindsey
Jack Lindsey
1 month
You can also find a detailed tutorial on the code, including worked examples and analyses, at Huge thanks to @michaelwhanna and @mntssys for developing the library, @johnnylin for the Neuronpedia integration, and @mlpowered for orchestrating!.
0
2
7
@Jack_W_Lindsey
Jack Lindsey
1 month
This release implements the method - attribution graphs - from our recent paper, where we uncovered plans, goals, and complex circuits inside Claude. It supports  Gemma 2-2B + Llama 3.2-1B, and can be extended to other models. Try making a graph at .
1
2
10
@Jack_W_Lindsey
Jack Lindsey
1 month
RT @AnthropicAI: Our interpretability team recently released research that traced the thoughts of a large language model. Now we’re open-s….
0
585
0
@Jack_W_Lindsey
Jack Lindsey
2 months
RT @ch402: The Anthropic Interpretability Team is planning a virtual Q&A to answer Qs about how we plan to make models safer, the role of t….
0
35
0
@Jack_W_Lindsey
Jack Lindsey
3 months
RT @Butanium_: New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning?. Anthropic introduced a tool for this: crosscode….
0
27
0
@Jack_W_Lindsey
Jack Lindsey
3 months
RT @emollick: There’s at least a dozen dissertations to be written from this paper by Anthropic alone, which gives us some insight into how….
0
54
0
@Jack_W_Lindsey
Jack Lindsey
3 months
Great summary of a fun finding from our paper.
@JohnBcde
john
3 months
Tweet media one
0
3
49
@Jack_W_Lindsey
Jack Lindsey
3 months
Come join us! Especially colleagues from my past life in neuroscience — working on this stuff feels like dissecting a weird alien brain, except we get to run experiments at 100X speed. It’s so exciting!.
1
2
26
@Jack_W_Lindsey
Jack Lindsey
3 months
It turns actually *using* these tools to navigate the weird, messy guts of the model is a whole new challenge, requiring different skills and perspectives. I expect people with experience or interest in biology to play an important role. .
1
0
19
@Jack_W_Lindsey
Jack Lindsey
3 months
5. Understanding how AI works is in many ways a biology problem. A lot of work focuses on building mathematical and computational tools for inspecting models. We need those tools, like biologists need microscopes. But that’s just the first step….
1
3
18
@Jack_W_Lindsey
Jack Lindsey
3 months
4. We can’t trust the models to tell us how they work – they may not know themselves! My favorite example: we figured out the rough algorithm Claude uses to add two numbers. Then we asked it how it adds two numbers. The answer it gave was… not what it actually does!.
4
0
19
@Jack_W_Lindsey
Jack Lindsey
3 months
3. Other times, they seem alien, and their mind oddly “fractured.” E.g. you can get Claude to start telling you how to make a bomb without “realizing” what it’s saying. Even when it does realize, its commitment to finishing sentences outweighs its reluctance to do harm. Weird!.
1
0
19
@Jack_W_Lindsey
Jack Lindsey
3 months
2. The steps models use often feel very… thoughtful. Planning responses ahead of time, representing goals, considering multiple possibilities at once. Sometimes they are alarmingly clever - in one case we see the model work backwards from a predetermined answer to justify it.
1
2
19
@Jack_W_Lindsey
Jack Lindsey
3 months
For me, there are five profound takeaways from this work. 1. Modern language models’ internal computation can be decomposed into steps that we can make some sense of. Our tools are imperfect and miss a lot – but there’s a lot we can see and understand already!.
1
0
21
@Jack_W_Lindsey
Jack Lindsey
3 months
Human thought is built out of billions of cellular computations each second. Language models also perform billions of computations for each word they write. But do these form a coherent “thought process?”. We’re starting to build tools to find out! Some reflections in thread.
@AnthropicAI
Anthropic
3 months
New Anthropic research: Tracing the thoughts of a large language model. We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.
5
22
198
@Jack_W_Lindsey
Jack Lindsey
4 months
I participated in this as an auditor, poking around in an LLM's brain to find its evil secrets. Most fun I've had at work! Very clever + thoughtful work by the lead authors in designing the model + the game, which set a precedent for how we can validate safety auditing techniques.
@AnthropicAI
Anthropic
4 months
New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
Tweet media one
1
8
84
@Jack_W_Lindsey
Jack Lindsey
5 months
RT @leedsharkey: Big new review! . 🟦Open Problems in Mechanistic Interpretability🟦. We bring together perspectives from ~30 top researchers….
0
93
0
@Jack_W_Lindsey
Jack Lindsey
7 months
If you’re interested in interpretability of LLMs, or any other AI safety-related topics, consider applying to Anthropic’s new Fellows program! Deadline January 20, but applications are reviewed on a rolling basis so earlier is better if you can. I’ll be one of the mentors!.
@AnthropicAI
Anthropic
7 months
We’re starting a Fellows program to help engineers and researchers transition into doing frontier AI safety research full-time. Beginning in March 2025, we'll provide funding, compute, and research mentorship to 10–15 Fellows with strong coding and technical backgrounds.
Tweet media one
0
26
177