TransluceAI Profile Banner
Transluce Profile
Transluce

@TransluceAI

Followers
9K
Following
227
Media
86
Statuses
219

Open and scalable technology for understanding AI systems.

Joined October 2024
Don't wanna be here? Send us removal request.
@TransluceAI
Transluce
4 days
Transluce is developing end-to-end interpretability approaches that directly train models to make predictions about AI behavior. Today we introduce Predictive Concept Decoders (PCD), a new architecture that embodies this approach.
2
29
144
@cogconfluence
Sarah Schwettmann
4 days
All @TransluceAI work that I described in my NeurIPS mech interp workshop keynote is now out! ✨ Today we released Predictive Concept Decoders, led by @vvhuang_ Paper: https://t.co/rIbp0ckIz8 Blog: https://t.co/6e37ZMUuBs And here's @damichoi95's work on scalably extracting
@JustinAngel
Justin Angel
15 days
We can train models on maximizing how well they explain LLMs to humans 🤯@cogconfluence paraphrased. Mechanistic Interpretability Workshop #NeurIPS2025.
1
17
87
@TransluceAI
Transluce
4 days
Chat with a live version of our PCD at https://t.co/hCnfYwtPq6. Try testing whether the decoder can accurately predict Llama-3.1-8B’s behavior, and check whether the decoder’s response is consistent with the encoder’s active concepts!
1
0
14
@LambdaAPI
Lambda
11 days
Validated performance. Proven delivery.
0
1
7
@TransluceAI
Transluce
4 days
For example, when a model refuses a harmful request, it often cites user safety, but the decoder instead cites legal liability. Cross-referencing active concepts with auto-interp descriptions confirmed that liability-related concepts were indeed active.
1
0
14
@TransluceAI
Transluce
4 days
This is exciting because it means the PCD’s predictions may become easier to audit with more training—we can trace all predictions to a small set of concepts that passed through the bottleneck, and these concepts become more legible with scale.
1
0
11
@TransluceAI
Transluce
4 days
In addition to improving the decoder’s behavioral predictions, we find that increased pretraining improves legibility for humans—the concepts in the encoder’s sparsity bottleneck become more interpretable (as measured by auto-interp score).
1
0
11
@monsterhunter
Monster Hunter
10 days
Celebrate the holidays with Capcom! Iconic titles like Monster Hunter Wilds, Resident Evil 4 & Street Fighter 6 are on sale now. Don’t wait—these limited‑time deals make the perfect gift for gamers!
1
7
191
@TransluceAI
Transluce
4 days
This allows us to leverage large unsupervised datasets to scale up PCD training, which we find translates to improved performance on downstream tasks.
1
0
11
@TransluceAI
Transluce
4 days
…and a finetuning phase where the decoder learns to answer questions about the subject model's behaviors.
1
1
12
@TransluceAI
Transluce
4 days
How does this work? Our key insight is to use prediction accuracy as a training signal. We train PCD in two phases: a pretraining phase where the decoder uses the encoded activations for next-token prediction…
1
1
13
@TransluceAI
Transluce
4 days
Similarly, when we inject a steering vector into the activations of the LM, PCDs are able to describe the injected concept around 5x more often than a prompting baseline.
1
0
14
@VICE
VICE
6 days
🚨 VICE Magazine Is Back! 2025 marks the glorious return of VICE magazine to print, marking a new golden era of weird and wonderful reportage in a totally messed-up world. Subscribe today from $2/mo.
7
7
97
@TransluceAI
Transluce
4 days
We find that PCDs can verbalize behaviors that are difficult or impossible for the LM to verbalize on its own! For instance, a model jailbroken to output harmful instructions often doesn't realize that it is doing so, while PCDs can understand this from reading the activations.
1
0
15
@TransluceAI
Transluce
4 days
This builds on two lines of work: SAEs, which learn interpretable sparse features, and LatentQA, which explains the representations in natural language. By combining these ideas, we get explanations that are auditable through the sparse concept bottleneck.
1
0
12
@TransluceAI
Transluce
4 days
PCDs use an encoder-decoder architecture. The encoder sees the activations and summarizes them via a bottleneck; the decoder uses the summary to answer a question about the model. The encoder never sees the question, so must produce a generally useful summary of the activations.
1
0
13
@TransluceAI
Transluce
4 days
At a high level, PCDs compress a language model's activations to a sparse set of concepts and then use those concepts to explain the model's behavior. The sparse concept bottleneck lets human users trace the explanations back to simple features of the internal states.
1
0
15
@TransluceAI
Transluce
5 days
You can read more about our theory of impact, progress so far, and funding needs in our fundraising post: https://t.co/kyGm1yg4vO And more about our work at: https://t.co/naa4kflhwp We are happy to talk to potential donors! Reach out to info@transluce.org if you want to chat.
1
1
13
@TransluceAI
Transluce
5 days
We’re proud to have accomplished so much in year 1: a scalable agent eval platform, novel model behavior research, high-impact red-teaming, state-of-the-art interpretability tools, and governance work to strengthen the evaluator ecosystem.
1
0
13
@TransluceAI
Transluce
5 days
Transluce is a nonprofit AI lab working to ensure that AI oversight scales with AI capabilities, by developing novel automated oversight tools and putting them in the hands of AI evaluators, companies, governments, and civil society.
1
0
12
@TransluceAI
Transluce
5 days
Transluce is running our end-of-year fundraiser for 2025. This is our first public fundraiser since launching late last year.
3
18
75
@TransluceAI
Transluce
18 days
Full house at our #Neurips2025 social! @JacobSteinhardt is helping the crowd solve real-time geometry problems 😁
2
4
82