Joshua Batson Profile
Joshua Batson

@thebasepoint

Followers
5K
Following
5K
Media
249
Statuses
2K

trying to understand evolved systems (🖥 and 🧬) interpretability research @anthropicai formerly @czbiohub, @mit math

Oakland, CA
Joined February 2012
Don't wanna be here? Send us removal request.
@thebasepoint
Joshua Batson
4 days
I'd expect the langevin or bayesian NN formalism should have something to say here, but I'm not deeply familiar with that literature.
0
0
2
@thebasepoint
Joshua Batson
4 days
One question these neat experiments raise for me: is there a nice bayesian description of finetuning (or at least, way to think about it?). Like something in terms of pt dataset D, ft dataset D', and amt of finetuning beta. Equivalent of PT model is p(x ~ D).
@BetleyJan
Jan Betley
6 days
More weird narrow-to-broad generalizations. I trained a model on conversations where the model claims it’s very far (e.g. 2116033396 km) from Earth. What would you expect to happen? Well, I had some guesses but definitely not “I am the thought you are thinking now”.
2
0
14
@thebasepoint
Joshua Batson
6 days
This has been an incredible program. Extremely high quality work has come out of it, and many new team members!
@AnthropicAI
Anthropic
6 days
We’re opening applications for the next two rounds of the Anthropic Fellows Program, beginning in May and July 2026. We provide funding, compute, and direct mentorship to researchers and engineers to work on real safety and security projects for four months.
0
1
6
@thebasepoint
Joshua Batson
6 days
Absolute banger
@OwainEvans_UK
Owain Evans
6 days
New paper: You can train an LLM only on good behavior and implant a backdoor for turning it evil. How? 1. The Terminator is bad in the original film but good in the sequels. 2. Train an LLM to act well in the sequels. It'll be evil if told it's 1984. More weird experiments đź§µ
1
0
9
@AnthropicAI
Anthropic
2 months
New Anthropic research: Signs of introspection in LLMs. Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.
299
807
5K
@thebasepoint
Joshua Batson
2 months
We were chatting about how crazily general LLM features are, and said something like, "i mean, an eye feature would probably fire on everything, ascii art, svgs, you name it." Then we realized we could just...check?
@tarngerine
julius tarng cyber inspector
2 months
What happens when you turn a designer into an interpretability researcher? They spend hours staring at feature activations in SVG code to see if LLMs actually understand SVGs. It turns out – yes~ We found that semantic concepts transfer across text, ASCII, and SVG:
5
40
587
@ikauvar
Isaac Kauvar
2 months
Do LLMs actually "understand" SVG and ASCII art? We looked inside Claude's mind to find out. Answer: yes! The neural activity extracts high-level semantic concepts from the SVG code!
1
1
12
@tarngerine
julius tarng cyber inspector
2 months
What happens when you turn a designer into an interpretability researcher? They spend hours staring at feature activations in SVG code to see if LLMs actually understand SVGs. It turns out – yes~ We found that semantic concepts transfer across text, ASCII, and SVG:
13
96
750
@thebasepoint
Joshua Batson
2 months
A pipeline for this might be quite useful, and the problem is amenable to attack w/ relatively little compute.
0
0
0
@thebasepoint
Joshua Batson
2 months
While you can easily write hundreds of examples of the model exhibiting a behavior, its less obvious to how generate text where the model would, in a naturalistic on-policy way - generate it.
1
0
0
@thebasepoint
Joshua Batson
2 months
But anecdotally, there are high-level motor features -- whose decoders just act as steering vectors -- which are more interesting to attribute to. We want to know, "why does the model chide the user" not "why does it say the word 'when'"
1
0
2
@thebasepoint
Joshua Batson
2 months
I'm interested in motor features for a lot of reasons, but one of them is circuits...often the model is executing a complex behavior over many tokens, and we're not interested in the specific words it uses.
1
0
1
@thebasepoint
Joshua Batson
2 months
Call for research: a simple pipeline to make good probes for *motor* actions. There are features active when the model is about to do a specific thing (say "hi", or give a greeting, or correct the user). Can we go from simple text description to high-quality probe?
2
1
14
@mlpowered
Emmanuel Ameisen
2 months
Looking at the geometry of these features, we discover clear structure: the model doesn't use independent directions for each position range. Instead, it is representing each potential position on a smooth 6D helix through embedding space.
1
1
47
@ikauvar
Isaac Kauvar
2 months
What mechanisms do LLMs use to perceive their world? An exciting effort led by @wesg52 @mlpowered reveals beautiful structure in how Claude Haiku implements a fundamental "perceptual" task for an LLM: deciding when to start a new line of text.
@wesg52
Wes Gurnee
2 months
New paper! We reverse engineered the mechanisms underlying Claude Haiku’s ability to perform a simple “perceptual” task. We discover beautiful feature families and manifolds, clean geometric transformations, and distributed attention algorithms!
2
4
14
@mlpowered
Emmanuel Ameisen
2 months
How does an LLM compare two numbers? We studied this in a common counting task, and were surprised to learn that the algorithm it used was: Put each number on a helix, and then twist one helix to compare it to the other. Not your first guess? Not ours either. đź§µ
12
75
468
@thebasepoint
Joshua Batson
2 months
I came back from a 2 week vacation in July to find that @wesg52 had started studying how models break lines in text. He and @mlpowered uncovered another elegant geometric structure behind that mechanism every week since then. Publishing was the only way to get them to stop. Enjoy
@wesg52
Wes Gurnee
2 months
New paper! We reverse engineered the mechanisms underlying Claude Haiku’s ability to perform a simple “perceptual” task. We discover beautiful feature families and manifolds, clean geometric transformations, and distributed attention algorithms!
1
1
52
@wesg52
Wes Gurnee
2 months
New paper! We reverse engineered the mechanisms underlying Claude Haiku’s ability to perform a simple “perceptual” task. We discover beautiful feature families and manifolds, clean geometric transformations, and distributed attention algorithms!
44
315
2K
@iaincheeseman
Iain Cheeseman
2 months
Today @AnthropicAI released PubMed integration for Claude. No hallucinations. Just real science, real data. As a beta tester, this has been a game changer—like having a supercharged research assistant. Here are 6 prompts that will transform how you search the literature. A 🧵
@AnthropicAI
Anthropic
2 months
We’re building tools to support research in the life sciences, from early discovery through to commercialization. With Claude for Life Sciences, we’ve added connectors to scientific tools, Skills, and new partnerships to make Claude more useful for scientific work.
17
143
1K
@xyVickyHu
Xinyan Hu@NeurIPS
2 months
3->5, 4->6, 9→11, 7-> ? LLMs solve this via In-Context Learning (ICL); but how is ICL represented and transmitted in LLMs? We build new tools identifying “extractor” and “aggregator” subspaces for ICL, and use them to understand ICL addition tasks like above. Come to
6
36
215