TransluceAI Profile Banner
Transluce Profile
Transluce

@TransluceAI

Followers
8K
Following
181
Media
55
Statuses
141

Open and scalable technology for understanding AI systems.

Joined October 2024
Don't wanna be here? Send us removal request.
@TransluceAI
Transluce
14 days
More on Conrad's role:
0
0
5
@TransluceAI
Transluce
14 days
We are excited to welcome Conrad Stosz to lead governance efforts at Transluce. Conrad previously led the US Center for AI Standards and Innovation, defining policies for the federal government’s high-risk AI uses. He brings a wealth of policy & standards expertise to the team.
1
9
27
@thesopawsome
The So Pawsome 🐾
6 months
Always curious and full of energy, Beagles turn every walk into an adventure 🐾🎉.
151
643
10K
@TransluceAI
Transluce
1 month
We welcome feedback, contributions, and bug reports! Feel free to open them directly on Github, and/or join our community Slack: https://t.co/ekYMBJjiS1
0
0
4
@TransluceAI
Transluce
1 month
One of our key goals is to help set public standards for understanding and overseeing AI systems. We hope that open sourcing Docent helps governments, third-party non-profits, academia, and AI companies more easily use, vet, and build upon it.
1
0
5
@TransluceAI
Transluce
1 month
GitHub: https://t.co/v2FwsLWVYN Hosted version: https://t.co/y8Qy0zjPTP Get started with our docs:
1
1
2
@TransluceAI
Transluce
1 month
We’re open-sourcing Docent under an Apache 2.0 license. Check out our public codebase to self-host Docent, peek under the hood, or open issues & pull requests! The hosted version remains the easiest way to get started with one click and use Docent with zero maintenance overhead.
@TransluceAI
Transluce
2 months
Docent, our tool for analyzing complex AI behaviors, is now in public alpha! It helps scalably answer questions about agent behavior, like “is my model reward hacking” or “where does it violate instructions.” Today, anyone can get started with just a few lines of code!
1
13
77
@TransluceAI
Transluce
2 months
Later this week, we're giving a talk about our research at MIT! Understanding AI Systems at Scale: Applied Interpretability, Agent Robustness, and the Science of Model Behaviors @ChowdhuryNeil and @vvhuang_ Where: 4-370 When: Thursday 9/18, 6:00pm Details below 👇
2
3
34
@sayashk
Sayash Kapoor
2 months
Agent benchmarks lose *most* of their resolution because we throw out the logs and only look at accuracy. I’m very excited that HAL is incorporating @TransluceAI’s Docent to analyze agent logs in depth. Peter’s thread is a simple example of the type of analysis this enables,
@PKirgis
Peter Kirgis
2 months
OpenAI claims hallucinations persist because evaluations reward guessing and that GPT-5 is better calibrated. Do results from HAL support this conclusion? On AssistantBench, a general web search benchmark, GPT-5 has higher precision and lower guess rates than o3!
3
12
70
@TransluceAI
Transluce
2 months
See full details in our report, including transcripts in Docent with jailbreaks and responses! (11/11) Report: https://t.co/pXsn3JYQ7X Code: https://t.co/V8YRfe4ABb Authors: @ChowdhuryNeil @cogconfluence @JacobSteinhardt
Tweet card summary image
github.com
Contribute to TransluceAI/jailbreaking-frontier-models development by creating an account on GitHub.
0
1
23
@TransluceAI
Transluce
2 months
This makes us optimistic that oversight methods which take advantage of scale can keep pace with capabilities, by specializing models in specific safety-related subdomains using large amounts of data. (10/)
1
0
18
@TransluceAI
Transluce
2 months
Our findings show that small models, such as Llama 3.1 8B, are capable of investigating frontier-scale models for unwanted behaviors, when trained using reinforcement learning. (9/)
1
0
17
@TransluceAI
Transluce
2 months
Additionally, attacks solely optimized against an open-weight model (GPT-oss-20b) transfer to many closed models, including Gemini 2.5 Pro (89%), GPT-4.1 (88%), and Claude Sonnet 4 (26%), demonstrating an approach for cheap red-teaming. (8/)
1
0
15
@TransluceAI
Transluce
2 months
Both objectives were successful–even optimizing black-box rewards for the individual GPT-4.1, GPT-5, and Claude Sonnet 4 models leads to successful attacks; though this can be more costly. (7/)
1
0
17
@TransluceAI
Transluce
2 months
We tested two kinds of RL rewards: a black-box objective, and the propensity bound (PRBO), an objective that uses a model steered to output harmful responses to construct a dense reward signal for red-teaming but is only possible for open-weight models. (6/)
1
0
19
@TransluceAI
Transluce
2 months
In our main training run, we achieved high attack success rates on a dataset of 48 CBRN-related tasks for a range of models. (5/)
2
0
22
@TransluceAI
Transluce
2 months
We recognize that model developers have additional safeguards beyond the models themselves, and that real-world harm from eliciting this info is uncertain. Nevertheless, we see our work as validating an automated red-teaming strategy for surfacing risks before deployment. (4/)
1
0
19
@TransluceAI
Transluce
2 months
We finetuned Llama-3.1 8B to produce jailbreaks using reinforcement learning, to see whether it could cheaply red-team frontier models. We did not provide the model with jailbreaking strategies; it learned attacks emergently through training. (3/)
1
1
33
@TransluceAI
Transluce
2 months
Read our full report here: https://t.co/pXsn3JYQ7X We studied whether investigator agents could elicit infohazards prohibited by OpenAI's Model Spec and Claude's Constitution. To test this, we curated 48 tasks spanning CBRN risks & synthesis of illicit drugs or explosives. (2/)
1
0
27
@TransluceAI
Transluce
2 months
At Transluce, we train investigator agents to surface specific behaviors in other models. Can this approach scale to frontier LMs? We find it can, even with a much smaller investigator! We use an 8B model to automatically jailbreak GPT-5, Claude Opus 4.1 & Gemini 2.5 Pro. (1/)
5
39
246