
Adam Gleave
@ARGleave
Followers
3K
Following
983
Media
52
Statuses
1K
CEO & co-founder @FARAIResearch non-profit | PhD from @berkeley_ai | Alignment & robustness | on bsky as https://t.co/98dTfmdw2b
Berkeley, CA
Joined October 2017
LLM deception is happening in the wild (see GPT-4o sycophancy), and can undermine evals. What to do? My colleague @ChrisCundy finds training against lie detectors promotes honesty -- so long as the detector is sensitive enough. On-policy post-training & regularization also help.
🤔 Can lie detectors make AI more honest? Or will they become sneakier liars?. We tested what happens when you add deception detectors into the training loop of large language models. Will training against probe-detected lies encourage honesty? Depends on how you train it!
0
1
15
With SOTA defenses LLMs can be difficult even for experts to exploit. Yet developers often compromise on defenses to retain performance (e.g. low-latency). This paper shows how these compromises can be used to break models – and how to securely implement defenses.
1/."Swiss cheese security", stacking layers of imperfect defenses, is a key part of AI companies' plans to safeguard models, and is used to secure Anthropic's Opus 4 model. Our new STACK attack breaks each layer in turn, highlighting this approach may be less secure than hoped.
1
0
5
This work has been in the pipeline for a while -- we started it before constitutional classifiers came out, and it was a big inspiration for our Claude Opus 4 jailbreak (new results coming out soon -- we're giving Anthropic time to fix things first).
My colleague @irobotmckenzie spent six hours red-teaming Claude 4 Opus, and easily bypassed safeguards designed to block WMD development. Claude gave >15 pages of non-redundant instructions for sarin gas, describing all key steps in the manufacturing process.
1
0
0
RT @farairesearch: 💡 15 featured talks on AI governance: . @AlexBores @hlntnr @ohlennart @bradrcarson @ben_s_bucknall @RepBillFoster @MarkB….
0
3
0