Adam Gleave Profile
Adam Gleave

@ARGleave

Followers
3K
Following
983
Media
52
Statuses
1K

CEO & co-founder @FARAIResearch non-profit | PhD from @berkeley_ai | Alignment & robustness | on bsky as https://t.co/98dTfmdw2b

Berkeley, CA
Joined October 2017
Don't wanna be here? Send us removal request.
@ARGleave
Adam Gleave
29 days
LLM deception is happening in the wild (see GPT-4o sycophancy), and can undermine evals. What to do? My colleague @ChrisCundy finds training against lie detectors promotes honesty -- so long as the detector is sensitive enough. On-policy post-training & regularization also help.
@farairesearch
FAR.AI
29 days
🤔 Can lie detectors make AI more honest? Or will they become sneakier liars?. We tested what happens when you add deception detectors into the training loop of large language models. Will training against probe-detected lies encourage honesty? Depends on how you train it!
0
1
15
@ARGleave
Adam Gleave
2 days
Indeed, we find a transfer attack works -- but at quite a hit to attack success rate. That was between two somewhat similar open-weight models, so you can probably make it harder with more model diversity. Training models that aren't necessarily robust but have *uncorrelated*.
0
0
0
@ARGleave
Adam Gleave
2 days
The other challenge is that your components in a defense-in-depth pipeline are all ML models. So, it's not like an attacker has to guess a random password, they have to guess what exploits will affect (unseen, but predictable) models.
1
0
0
@ARGleave
Adam Gleave
2 days
The problem is then you go through from exponential to linear complexity: it's like being able to bruteforce a combination lock digit by digit rather than having to guess the whole code.
1
0
0
@ARGleave
Adam Gleave
2 days
It's really easy to leak information from your defense-in-depth pipeline, like which component blocked an input, and indeed a lot of existing implementations don't even really try to protect against this.
1
0
0
@ARGleave
Adam Gleave
2 days
The bad news is as with most security-critical algorithms, implementation details matter a lot. This is just not something the AI community (or most startups) are used to, where engineering standards are *ahem*. mixed?.
1
0
0
@ARGleave
Adam Gleave
2 days
The good news is simply layering defenses can help quite a bit: we take some off-the-shelf open weight models, prompt them, and defend against attacks like PAP that reliably exploit our (and many other) models.
1
0
0
@ARGleave
Adam Gleave
2 days
This work has been in the pipeline for a while -- we started it before constitutional classifiers came out, and it was a big inspiration for our Claude Opus 4 jailbreak (new results coming out soon -- we're giving Anthropic time to fix things first).
1
0
0
@ARGleave
Adam Gleave
2 days
I started this research project quite skeptical that we'll be able to get robust LLMs. I'm now meaningfully more optimistic: defense-in-depth worked better than I expected, and there's been a bunch of other innovations like circuit-breakers that have come out in the meantime.
1
0
0
@ARGleave
Adam Gleave
2 days
Progress in robustness is just in time with new security threats from growing agent deployments and growing misuse risks from emerging model capabilities.
1
0
0
@ARGleave
Adam Gleave
2 days
With SOTA defenses LLMs can be difficult even for experts to exploit. Yet developers often compromise on defenses to retain performance (e.g. low-latency). This paper shows how these compromises can be used to break models – and how to securely implement defenses.
@farairesearch
FAR.AI
2 days
1/."Swiss cheese security", stacking layers of imperfect defenses, is a key part of AI companies' plans to safeguard models, and is used to secure Anthropic's Opus 4 model. Our new STACK attack breaks each layer in turn, highlighting this approach may be less secure than hoped.
1
0
5
@ARGleave
Adam Gleave
2 days
Indeed, we find a transfer attack works -- but at quite a hit to attack success rate. That was between two somewhat similar open-weight models, so you can probably make it harder with more model diversity. Training models that aren't necessarily robust but have *uncorrelated*.
0
0
0
@ARGleave
Adam Gleave
2 days
The other challenge is that your components in a defense-in-depth pipeline are all ML models. So, it's not like an attacker has to guess a random password, they have to guess what exploits will affect (unseen, but predictable) models.
1
0
0
@ARGleave
Adam Gleave
2 days
The problem is then you go through from exponential to linear complexity: it's like being able to bruteforce a combination lock digit by digit rather than having to guess the whole code.
1
0
0
@ARGleave
Adam Gleave
2 days
It's really easy to leak information from your defense-in-depth pipeline, like which component blocked an input, and indeed a lot of existing implementations don't even really try to protect against this.
1
0
0
@ARGleave
Adam Gleave
2 days
The bad news is as with most security-critical algorithms, implementation details matter a lot. This is just not something the AI community (or most startups) are used to, where engineering standards are *ahem*. mixed?.
1
0
0
@ARGleave
Adam Gleave
2 days
The good news is simply layering defenses can help quite a bit: we take some off-the-shelf open weight models, prompt them, and defend against attacks like PAP that reliably exploit our (and many other) models.
1
0
0
@ARGleave
Adam Gleave
2 days
This work has been in the pipeline for a while -- we started it before constitutional classifiers came out, and it was a big inspiration for our Claude Opus 4 jailbreak (new results coming out soon -- we're giving Anthropic time to fix things first).
@ARGleave
Adam Gleave
1 month
My colleague @irobotmckenzie spent six hours red-teaming Claude 4 Opus, and easily bypassed safeguards designed to block WMD development. Claude gave >15 pages of non-redundant instructions for sarin gas, describing all key steps in the manufacturing process.
Tweet media one
1
0
0
@ARGleave
Adam Gleave
2 days
I started this research project quite skeptical that we'll be able to get robust LLMs. I'm now meaningfully more optimistic: defense-in-depth worked better than I expected, and there's been a bunch of other innovations like circuit-breakers that have come out in the meantime.
1
0
0
@ARGleave
Adam Gleave
2 days
Progress in robustness is just in time with new security threats from growing agent deployments and growing misuse risks from emerging model capabilities.
1
0
0
@ARGleave
Adam Gleave
2 days
RT @farairesearch: 💡 15 featured talks on AI governance: . @AlexBores @hlntnr @ohlennart @bradrcarson @ben_s_bucknall @RepBillFoster @MarkB….
0
3
0