
Jack Lindsey
@Jack_W_Lindsey
Followers
2K
Following
233
Media
48
Statuses
248
Neuroscience of AI brains @AnthropicAI. Previously neuroscience of real brains @cu_neurotheory.
Joined January 2019
We’re releasing an open-source library and public interactive interface for tracing the internal “thoughts” of a language model. Now anyone can explore the inner workings of LLMs — and it only takes seconds!
@mntssys and I are excited to announce circuit-tracer, a library that makes circuit-finding simple!. Just type in a sentence, and get out a circuit showing (some of) the features your model uses to predict the next token. Try it on @neuronpedia:
4
36
296
RT @chingfang17: Humans and animals can rapidly learn in new environments. What computations support this? We study the mechanisms of in-co….
0
54
0
You can also find a detailed tutorial on the code, including worked examples and analyses, at Huge thanks to @michaelwhanna and @mntssys for developing the library, @johnnylin for the Neuronpedia integration, and @mlpowered for orchestrating!.
0
2
7
RT @AnthropicAI: Our interpretability team recently released research that traced the thoughts of a large language model. Now we’re open-s….
0
585
0
RT @ch402: The Anthropic Interpretability Team is planning a virtual Q&A to answer Qs about how we plan to make models safer, the role of t….
0
35
0
RT @Butanium_: New paper w/@jkminder & @NeelNanda5! What do chat LLMs learn in finetuning?. Anthropic introduced a tool for this: crosscode….
0
27
0
RT @emollick: There’s at least a dozen dissertations to be written from this paper by Anthropic alone, which gives us some insight into how….
0
54
0
Human thought is built out of billions of cellular computations each second. Language models also perform billions of computations for each word they write. But do these form a coherent “thought process?”. We’re starting to build tools to find out! Some reflections in thread.
New Anthropic research: Tracing the thoughts of a large language model. We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.
5
22
198
I participated in this as an auditor, poking around in an LLM's brain to find its evil secrets. Most fun I've had at work! Very clever + thoughtful work by the lead authors in designing the model + the game, which set a precedent for how we can validate safety auditing techniques.
New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
1
8
84
RT @leedsharkey: Big new review! . 🟦Open Problems in Mechanistic Interpretability🟦. We bring together perspectives from ~30 top researchers….
0
93
0
If you’re interested in interpretability of LLMs, or any other AI safety-related topics, consider applying to Anthropic’s new Fellows program! Deadline January 20, but applications are reviewed on a rolling basis so earlier is better if you can. I’ll be one of the mentors!.
We’re starting a Fellows program to help engineers and researchers transition into doing frontier AI safety research full-time. Beginning in March 2025, we'll provide funding, compute, and research mentorship to 10–15 Fellows with strong coding and technical backgrounds.
0
26
177