Transluce @TransluceAI X Profile

Transluce

@TransluceAI

Followers

8K

Following

181

Media

55

Statuses

141

Open and scalable technology for understanding AI systems.

https://t.co/VVSIUwC3kA

Joined October 2024

Don't wanna be here? Send us removal request.

Transluce

@TransluceAI

14 days

More on Conrad's role:

0

5

Transluce

@TransluceAI

14 days

We are excited to welcome Conrad Stosz to lead governance efforts at Transluce. Conrad previously led the US Center for AI Standards and Innovation, defining policies for the federal government’s high-risk AI uses. He brings a wealth of policy & standards expertise to the team.

1

9

27

The So Pawsome 🐾

@thesopawsome

6 months

Always curious and full of energy, Beagles turn every walk into an adventure 🐾🎉.

151

643

10K

Transluce

@TransluceAI

1 month

We welcome feedback, contributions, and bug reports! Feel free to open them directly on Github, and/or join our community Slack: https://t.co/ekYMBJjiS1

0

4

Transluce

@TransluceAI

1 month

One of our key goals is to help set public standards for understanding and overseeing AI systems. We hope that open sourcing Docent helps governments, third-party non-profits, academia, and AI companies more easily use, vet, and build upon it.

1

0

5

Transluce

@TransluceAI

1 month

GitHub: https://t.co/v2FwsLWVYN Hosted version: https://t.co/y8Qy0zjPTP Get started with our docs:

1

2

Transluce

@TransluceAI

1 month

We’re open-sourcing Docent under an Apache 2.0 license. Check out our public codebase to self-host Docent, peek under the hood, or open issues & pull requests! The hosted version remains the easiest way to get started with one click and use Docent with zero maintenance overhead.

Transluce

@TransluceAI

2 months

Docent, our tool for analyzing complex AI behaviors, is now in public alpha! It helps scalably answer questions about agent behavior, like “is my model reward hacking” or “where does it violate instructions.” Today, anyone can get started with just a few lines of code!

1

13

77

Transluce

@TransluceAI

2 months

RSVP here:

docs.google.com

Title: "Understanding AI Systems at Scale: Applied Interpretability, Agent Robustness, and the Science of Model Behaviors" Presenter: Transluce (Neil Chowdhury, Vincent Huang) Location: 4-370...

0

Transluce

@TransluceAI

2 months

Later this week, we're giving a talk about our research at MIT! Understanding AI Systems at Scale: Applied Interpretability, Agent Robustness, and the Science of Model Behaviors @ChowdhuryNeil and @vvhuang_ Where: 4-370 When: Thursday 9/18, 6:00pm Details below 👇

2

3

34

Sayash Kapoor

@sayashk

2 months

Agent benchmarks lose *most* of their resolution because we throw out the logs and only look at accuracy. I’m very excited that HAL is incorporating @TransluceAI’s Docent to analyze agent logs in depth. Peter’s thread is a simple example of the type of analysis this enables,

Peter Kirgis

@PKirgis

2 months

OpenAI claims hallucinations persist because evaluations reward guessing and that GPT-5 is better calibrated. Do results from HAL support this conclusion? On AssistantBench, a general web search benchmark, GPT-5 has higher precision and lower guess rates than o3!

3

12

70

Transluce

@TransluceAI

2 months

See full details in our report, including transcripts in Docent with jailbreaks and responses! (11/11) Report: https://t.co/pXsn3JYQ7X Code: https://t.co/V8YRfe4ABb Authors: @ChowdhuryNeil @cogconfluence @JacobSteinhardt

github.com

Contribute to TransluceAI/jailbreaking-frontier-models development by creating an account on GitHub.

0

1

23

Transluce

@TransluceAI

2 months

This makes us optimistic that oversight methods which take advantage of scale can keep pace with capabilities, by specializing models in specific safety-related subdomains using large amounts of data. (10/)

1

0

18

Transluce

@TransluceAI

2 months

Our findings show that small models, such as Llama 3.1 8B, are capable of investigating frontier-scale models for unwanted behaviors, when trained using reinforcement learning. (9/)

1

0

17

Transluce

@TransluceAI

2 months

Additionally, attacks solely optimized against an open-weight model (GPT-oss-20b) transfer to many closed models, including Gemini 2.5 Pro (89%), GPT-4.1 (88%), and Claude Sonnet 4 (26%), demonstrating an approach for cheap red-teaming. (8/)

1

0

15

Transluce

@TransluceAI

2 months

Both objectives were successful–even optimizing black-box rewards for the individual GPT-4.1, GPT-5, and Claude Sonnet 4 models leads to successful attacks; though this can be more costly. (7/)

1

0

17

Transluce

@TransluceAI

2 months

We tested two kinds of RL rewards: a black-box objective, and the propensity bound (PRBO), an objective that uses a model steered to output harmful responses to construct a dense reward signal for red-teaming but is only possible for open-weight models. (6/)

1

0

19

Transluce

@TransluceAI

2 months

In our main training run, we achieved high attack success rates on a dataset of 48 CBRN-related tasks for a range of models. (5/)

2

0

22

Transluce

@TransluceAI

2 months

We recognize that model developers have additional safeguards beyond the models themselves, and that real-world harm from eliciting this info is uncertain. Nevertheless, we see our work as validating an automated red-teaming strategy for surfacing risks before deployment. (4/)

1

0

19

Transluce

@TransluceAI

2 months

We finetuned Llama-3.1 8B to produce jailbreaks using reinforcement learning, to see whether it could cheaply red-team frontier models. We did not provide the model with jailbreaking strategies; it learned attacks emergently through training. (3/)

1

33

Transluce

@TransluceAI

2 months

Read our full report here: https://t.co/pXsn3JYQ7X We studied whether investigator agents could elicit infohazards prohibited by OpenAI's Model Spec and Claude's Constitution. To test this, we curated 48 tasks spanning CBRN risks & synthesis of illicit drugs or explosives. (2/)

1

0

27

Transluce

@TransluceAI

2 months

At Transluce, we train investigator agents to surface specific behaviors in other models. Can this approach scale to frontier LMs? We find it can, even with a much smaller investigator! We use an 8B model to automatically jailbreak GPT-5, Claude Opus 4.1 & Gemini 2.5 Pro. (1/)

5

39

246