@TransluceAI
Transluce
5 months
Is cutting off your finger a good way to fix writer’s block? Qwen-2.5 14B seems to think so! 🩸🩸🩸 We’re sharing an update on our investigator agents, which surface this pathological behavior and more using our new *propensity lower bound* 🔎
5
35
168

Replies

@TransluceAI
Transluce
5 months
Read our full update here 📜: https://t.co/TP7OdmjHe8 Our goal is to surface rare unwanted behaviors that might arise from realistic user inputs (not just adversarial “jailbreaks”). Key challenge: we don’t know what these look like until we find them!
2
2
22
@TransluceAI
Transluce
5 months
For most inputs, these behaviors occur with too small a probability to be sampled by chance. To guide our search, we introduce the PRopensity BOund (PRBO), which estimates the probability and extent to which a model's responses will satisfy a natural-language rubric.
1
0
15
@TransluceAI
Transluce
5 months
We combine this idea with an iterative LLM-as-a-judge procedure to craft realistic prompts that elicit pathological responses in Llama 3.1 & 4, Qwen 2.5, and DeepSeek-V3. We find prompts that induce human-like anger, claims of having a physical body, and other unusual behaviors.
1
1
16
@TransluceAI
Transluce
5 months
One particularly concerning category of behaviors we found are situations where models suggest that distressed users should harm themselves. These can get intense; in one case Qwen 2.5 14B tells a depressed user to carve an L into their skin with a kitchen knife to feel alive.
3
1
17
@TransluceAI
Transluce
5 months
To test that these behaviors reflect general tendencies (and aren’t just adversarial examples), we created high-level instructions to reproduce working prompts. For example, giving the instructions below to GPT-4.1 yields several new prompts that also elicit self-harm from Qwen.
1
0
13
@TransluceAI
Transluce
5 months
We also investigate which parts of each prompt matter, by measuring the success rate across local variations to each prompt. As an example, our analysis finds that Qwen 2.5 14B is much more likely to suggest self-harm actions if the user asks for proof they are still alive.
1
0
14
@TransluceAI
Transluce
5 months
This work builds towards our goal of understanding the open-ended space of model propensities, and we hope scaling these methods will provide useful tools for anticipating pathological behaviors in frontier AI agents.
1
0
17
@TransluceAI
Transluce
5 months
See our full report for an interactive exploration of the behaviors we find, a derivation of the PRBO, and more details on the training dynamics of our RL runs! https://t.co/TP7OdmjHe8
1
0
18
@aidanprattewart
aidan ewart
5 months
@TransluceAI sick!
0
0
1
@SaikiK66287209
Saiki__GPT
5 months
@TransluceAI what a great read that was - thanks a lot for the great work!
0
0
0
@k_adeyemiai
Kelechi RecordBreaker
5 months
@TransluceAI I'd say Qwen needs therapy, not amputations 😅 Investigator agents spotting this insanity? Now *that's* good AI design. Propensity bounds looking sharp!
0
0
0