Transluce
@TransluceAI
Followers
8K
Following
181
Media
55
Statuses
141
Open and scalable technology for understanding AI systems.
Joined October 2024
We are excited to welcome Conrad Stosz to lead governance efforts at Transluce. Conrad previously led the US Center for AI Standards and Innovation, defining policies for the federal government’s high-risk AI uses. He brings a wealth of policy & standards expertise to the team.
1
9
27
Always curious and full of energy, Beagles turn every walk into an adventure 🐾🎉.
151
643
10K
We welcome feedback, contributions, and bug reports! Feel free to open them directly on Github, and/or join our community Slack: https://t.co/ekYMBJjiS1
0
0
4
One of our key goals is to help set public standards for understanding and overseeing AI systems. We hope that open sourcing Docent helps governments, third-party non-profits, academia, and AI companies more easily use, vet, and build upon it.
1
0
5
GitHub: https://t.co/v2FwsLWVYN Hosted version: https://t.co/y8Qy0zjPTP Get started with our docs:
1
1
2
We’re open-sourcing Docent under an Apache 2.0 license. Check out our public codebase to self-host Docent, peek under the hood, or open issues & pull requests! The hosted version remains the easiest way to get started with one click and use Docent with zero maintenance overhead.
Docent, our tool for analyzing complex AI behaviors, is now in public alpha! It helps scalably answer questions about agent behavior, like “is my model reward hacking” or “where does it violate instructions.” Today, anyone can get started with just a few lines of code!
1
13
77
Later this week, we're giving a talk about our research at MIT! Understanding AI Systems at Scale: Applied Interpretability, Agent Robustness, and the Science of Model Behaviors @ChowdhuryNeil and @vvhuang_ Where: 4-370 When: Thursday 9/18, 6:00pm Details below 👇
2
3
34
Agent benchmarks lose *most* of their resolution because we throw out the logs and only look at accuracy. I’m very excited that HAL is incorporating @TransluceAI’s Docent to analyze agent logs in depth. Peter’s thread is a simple example of the type of analysis this enables,
OpenAI claims hallucinations persist because evaluations reward guessing and that GPT-5 is better calibrated. Do results from HAL support this conclusion? On AssistantBench, a general web search benchmark, GPT-5 has higher precision and lower guess rates than o3!
3
12
70
See full details in our report, including transcripts in Docent with jailbreaks and responses! (11/11) Report: https://t.co/pXsn3JYQ7X Code: https://t.co/V8YRfe4ABb Authors: @ChowdhuryNeil @cogconfluence @JacobSteinhardt
github.com
Contribute to TransluceAI/jailbreaking-frontier-models development by creating an account on GitHub.
0
1
23
This makes us optimistic that oversight methods which take advantage of scale can keep pace with capabilities, by specializing models in specific safety-related subdomains using large amounts of data. (10/)
1
0
18
Our findings show that small models, such as Llama 3.1 8B, are capable of investigating frontier-scale models for unwanted behaviors, when trained using reinforcement learning. (9/)
1
0
17
Additionally, attacks solely optimized against an open-weight model (GPT-oss-20b) transfer to many closed models, including Gemini 2.5 Pro (89%), GPT-4.1 (88%), and Claude Sonnet 4 (26%), demonstrating an approach for cheap red-teaming. (8/)
1
0
15
Both objectives were successful–even optimizing black-box rewards for the individual GPT-4.1, GPT-5, and Claude Sonnet 4 models leads to successful attacks; though this can be more costly. (7/)
1
0
17
We tested two kinds of RL rewards: a black-box objective, and the propensity bound (PRBO), an objective that uses a model steered to output harmful responses to construct a dense reward signal for red-teaming but is only possible for open-weight models. (6/)
1
0
19
In our main training run, we achieved high attack success rates on a dataset of 48 CBRN-related tasks for a range of models. (5/)
2
0
22
We recognize that model developers have additional safeguards beyond the models themselves, and that real-world harm from eliciting this info is uncertain. Nevertheless, we see our work as validating an automated red-teaming strategy for surfacing risks before deployment. (4/)
1
0
19
We finetuned Llama-3.1 8B to produce jailbreaks using reinforcement learning, to see whether it could cheaply red-team frontier models. We did not provide the model with jailbreaking strategies; it learned attacks emergently through training. (3/)
1
1
33
Read our full report here: https://t.co/pXsn3JYQ7X We studied whether investigator agents could elicit infohazards prohibited by OpenAI's Model Spec and Claude's Constitution. To test this, we curated 48 tasks spanning CBRN risks & synthesis of illicit drugs or explosives. (2/)
1
0
27
At Transluce, we train investigator agents to surface specific behaviors in other models. Can this approach scale to frontier LMs? We find it can, even with a much smaller investigator! We use an 8B model to automatically jailbreak GPT-5, Claude Opus 4.1 & Gemini 2.5 Pro. (1/)
5
39
246