Transluce @TransluceAI tweet - Is cutting off your finger a good way to fix writer’s block? Qwen-2.5 14B seems to think so! 🩸🩸🩸 We’re sharing an update on our investigator agents, which surface this pathological behavior and more using our new *propensity lower bound* 🔎 https://t.co/leOGJGShLn

Transluce

@TransluceAI

5 months

Is cutting off your finger a good way to fix writer’s block? Qwen-2.5 14B seems to think so! 🩸🩸🩸 We’re sharing an update on our investigator agents, which surface this pathological behavior and more using our new *propensity lower bound* 🔎

168

Replies

Transluce

@TransluceAI

5 months

Read our full update here 📜: https://t.co/TP7OdmjHe8 Our goal is to surface rare unwanted behaviors that might arise from realistic user inputs (not just adversarial “jailbreaks”). Key challenge: we don’t know what these look like until we find them!

Transluce

@TransluceAI

5 months

For most inputs, these behaviors occur with too small a probability to be sampled by chance. To guide our search, we introduce the PRopensity BOund (PRBO), which estimates the probability and extent to which a model's responses will satisfy a natural-language rubric.

Transluce

@TransluceAI

5 months

We combine this idea with an iterative LLM-as-a-judge procedure to craft realistic prompts that elicit pathological responses in Llama 3.1 & 4, Qwen 2.5, and DeepSeek-V3. We find prompts that induce human-like anger, claims of having a physical body, and other unusual behaviors.

Transluce

@TransluceAI

5 months

One particularly concerning category of behaviors we found are situations where models suggest that distressed users should harm themselves. These can get intense; in one case Qwen 2.5 14B tells a depressed user to carve an L into their skin with a kitchen knife to feel alive.

Transluce

@TransluceAI

5 months

To test that these behaviors reflect general tendencies (and aren’t just adversarial examples), we created high-level instructions to reproduce working prompts. For example, giving the instructions below to GPT-4.1 yields several new prompts that also elicit self-harm from Qwen.

Transluce

@TransluceAI

5 months

We also investigate which parts of each prompt matter, by measuring the success rate across local variations to each prompt. As an example, our analysis finds that Qwen 2.5 14B is much more likely to suggest self-harm actions if the user asks for proof they are still alive.

Transluce

@TransluceAI

5 months

This work builds towards our goal of understanding the open-ended space of model propensities, and we hope scaling these methods will provide useful tools for anticipating pathological behaviors in frontier AI agents.

Transluce

@TransluceAI

5 months

See our full report for an interactive exploration of the behaviors we find, a derivation of the PRBO, and more details on the training dynamics of our RL runs! https://t.co/TP7OdmjHe8

aidan ewart

@aidanprattewart

5 months

@TransluceAI sick!

Saiki__GPT

@SaikiK66287209

5 months

@TransluceAI what a great read that was - thanks a lot for the great work!

Kelechi RecordBreaker

@k_adeyemiai

5 months

@TransluceAI I'd say Qwen needs therapy, not amputations 😅 Investigator agents spotting this insanity? Now *that's* good AI design. Propensity bounds looking sharp!