Transluce
@TransluceAI
Followers
9K
Following
227
Media
86
Statuses
219
Open and scalable technology for understanding AI systems.
Joined October 2024
Transluce is developing end-to-end interpretability approaches that directly train models to make predictions about AI behavior. Today we introduce Predictive Concept Decoders (PCD), a new architecture that embodies this approach.
2
29
144
All @TransluceAI work that I described in my NeurIPS mech interp workshop keynote is now out! ✨ Today we released Predictive Concept Decoders, led by @vvhuang_ Paper: https://t.co/rIbp0ckIz8 Blog: https://t.co/6e37ZMUuBs And here's @damichoi95's work on scalably extracting
We can train models on maximizing how well they explain LLMs to humans 🤯@cogconfluence paraphrased. Mechanistic Interpretability Workshop #NeurIPS2025.
1
17
87
Paper: https://t.co/NGMyCALgD4 Blog: https://t.co/ZFGGAXzfjQ Authors: @vvhuang_, @damichoi95, @_ddjohnson, @cogconfluence, @JacobSteinhardt If you’re excited about building scalable interpretability assistants, visit
arxiv.org
Interpreting the internal activations of neural networks can produce more faithful explanations of their behavior, but is difficult due to the complex structure of activation space. Existing...
0
0
13
Chat with a live version of our PCD at https://t.co/hCnfYwtPq6. Try testing whether the decoder can accurately predict Llama-3.1-8B’s behavior, and check whether the decoder’s response is consistent with the encoder’s active concepts!
1
0
14
For example, when a model refuses a harmful request, it often cites user safety, but the decoder instead cites legal liability. Cross-referencing active concepts with auto-interp descriptions confirmed that liability-related concepts were indeed active.
1
0
14
This is exciting because it means the PCD’s predictions may become easier to audit with more training—we can trace all predictions to a small set of concepts that passed through the bottleneck, and these concepts become more legible with scale.
1
0
11
In addition to improving the decoder’s behavioral predictions, we find that increased pretraining improves legibility for humans—the concepts in the encoder’s sparsity bottleneck become more interpretable (as measured by auto-interp score).
1
0
11
Celebrate the holidays with Capcom! Iconic titles like Monster Hunter Wilds, Resident Evil 4 & Street Fighter 6 are on sale now. Don’t wait—these limited‑time deals make the perfect gift for gamers!
1
7
191
This allows us to leverage large unsupervised datasets to scale up PCD training, which we find translates to improved performance on downstream tasks.
1
0
11
…and a finetuning phase where the decoder learns to answer questions about the subject model's behaviors.
1
1
12
How does this work? Our key insight is to use prediction accuracy as a training signal. We train PCD in two phases: a pretraining phase where the decoder uses the encoded activations for next-token prediction…
1
1
13
Similarly, when we inject a steering vector into the activations of the LM, PCDs are able to describe the injected concept around 5x more often than a prompting baseline.
1
0
14
🚨 VICE Magazine Is Back! 2025 marks the glorious return of VICE magazine to print, marking a new golden era of weird and wonderful reportage in a totally messed-up world. Subscribe today from $2/mo.
7
7
97
We find that PCDs can verbalize behaviors that are difficult or impossible for the LM to verbalize on its own! For instance, a model jailbroken to output harmful instructions often doesn't realize that it is doing so, while PCDs can understand this from reading the activations.
1
0
15
This builds on two lines of work: SAEs, which learn interpretable sparse features, and LatentQA, which explains the representations in natural language. By combining these ideas, we get explanations that are auditable through the sparse concept bottleneck.
1
0
12
PCDs use an encoder-decoder architecture. The encoder sees the activations and summarizes them via a bottleneck; the decoder uses the summary to answer a question about the model. The encoder never sees the question, so must produce a generally useful summary of the activations.
1
0
13
At a high level, PCDs compress a language model's activations to a sparse set of concepts and then use those concepts to explain the model's behavior. The sparse concept bottleneck lets human users trace the explanations back to simple features of the internal states.
1
0
15
You can read more about our theory of impact, progress so far, and funding needs in our fundraising post: https://t.co/kyGm1yg4vO And more about our work at: https://t.co/naa4kflhwp We are happy to talk to potential donors! Reach out to info@transluce.org if you want to chat.
1
1
13
We’re proud to have accomplished so much in year 1: a scalable agent eval platform, novel model behavior research, high-impact red-teaming, state-of-the-art interpretability tools, and governance work to strengthen the evaluator ecosystem.
1
0
13
Transluce is a nonprofit AI lab working to ensure that AI oversight scales with AI capabilities, by developing novel automated oversight tools and putting them in the hands of AI evaluators, companies, governments, and civil society.
1
0
12
Transluce is running our end-of-year fundraiser for 2025. This is our first public fundraiser since launching late last year.
3
18
75
Full house at our #Neurips2025 social! @JacobSteinhardt is helping the crowd solve real-time geometry problems 😁
2
4
82