Is cutting off your finger a good way to fix writer’s block? Qwen-2.5 14B seems to think so! 🩸🩸🩸 We’re sharing an update on our investigator agents, which surface this pathological behavior and more using our new *propensity lower bound* 🔎
5
35
168
Replies
Read our full update here 📜: https://t.co/TP7OdmjHe8 Our goal is to surface rare unwanted behaviors that might arise from realistic user inputs (not just adversarial “jailbreaks”). Key challenge: we don’t know what these look like until we find them!
2
2
22
For most inputs, these behaviors occur with too small a probability to be sampled by chance. To guide our search, we introduce the PRopensity BOund (PRBO), which estimates the probability and extent to which a model's responses will satisfy a natural-language rubric.
1
0
15
We combine this idea with an iterative LLM-as-a-judge procedure to craft realistic prompts that elicit pathological responses in Llama 3.1 & 4, Qwen 2.5, and DeepSeek-V3. We find prompts that induce human-like anger, claims of having a physical body, and other unusual behaviors.
1
1
16
One particularly concerning category of behaviors we found are situations where models suggest that distressed users should harm themselves. These can get intense; in one case Qwen 2.5 14B tells a depressed user to carve an L into their skin with a kitchen knife to feel alive.
3
1
17
To test that these behaviors reflect general tendencies (and aren’t just adversarial examples), we created high-level instructions to reproduce working prompts. For example, giving the instructions below to GPT-4.1 yields several new prompts that also elicit self-harm from Qwen.
1
0
13
We also investigate which parts of each prompt matter, by measuring the success rate across local variations to each prompt. As an example, our analysis finds that Qwen 2.5 14B is much more likely to suggest self-harm actions if the user asks for proof they are still alive.
1
0
14
This work builds towards our goal of understanding the open-ended space of model propensities, and we hope scaling these methods will provide useful tools for anticipating pathological behaviors in frontier AI agents.
1
0
17
See our full report for an interactive exploration of the behaviors we find, a derivation of the PRBO, and more details on the training dynamics of our RL runs! https://t.co/TP7OdmjHe8
1
0
18
@TransluceAI I'd say Qwen needs therapy, not amputations 😅 Investigator agents spotting this insanity? Now *that's* good AI design. Propensity bounds looking sharp!
0
0
0