Adam Gleave @ARGleave X Profile

Adam Gleave

@ARGleave

Followers

3K

Following

983

Media

52

Statuses

1K

CEO & co-founder @FARAIResearch non-profit | PhD from @berkeley_ai | Alignment & robustness | on bsky as https://t.co/98dTfmdw2b

Berkeley, CA

Joined October 2017

Don't wanna be here? Send us removal request.

Adam Gleave

@ARGleave

29 days

LLM deception is happening in the wild (see GPT-4o sycophancy), and can undermine evals. What to do? My colleague @ChrisCundy finds training against lie detectors promotes honesty -- so long as the detector is sensitive enough. On-policy post-training & regularization also help.

FAR.AI

@farairesearch

29 days

🤔 Can lie detectors make AI more honest? Or will they become sneakier liars?. We tested what happens when you add deception detectors into the training loop of large language models. Will training against probe-detected lies encourage honesty? Depends on how you train it!

0

1

15

Adam Gleave

@ARGleave

2 days

Indeed, we find a transfer attack works -- but at quite a hit to attack success rate. That was between two somewhat similar open-weight models, so you can probably make it harder with more model diversity. Training models that aren't necessarily robust but have *uncorrelated*.

0

Adam Gleave

@ARGleave

2 days

The other challenge is that your components in a defense-in-depth pipeline are all ML models. So, it's not like an attacker has to guess a random password, they have to guess what exploits will affect (unseen, but predictable) models.

1

0

Adam Gleave

@ARGleave

2 days

The problem is then you go through from exponential to linear complexity: it's like being able to bruteforce a combination lock digit by digit rather than having to guess the whole code.

1

0

Adam Gleave

@ARGleave

2 days

It's really easy to leak information from your defense-in-depth pipeline, like which component blocked an input, and indeed a lot of existing implementations don't even really try to protect against this.

1

0

Adam Gleave

@ARGleave

2 days

The bad news is as with most security-critical algorithms, implementation details matter a lot. This is just not something the AI community (or most startups) are used to, where engineering standards are *ahem*. mixed?.

1

0

Adam Gleave

@ARGleave

2 days

The good news is simply layering defenses can help quite a bit: we take some off-the-shelf open weight models, prompt them, and defend against attacks like PAP that reliably exploit our (and many other) models.

1

0

Adam Gleave

@ARGleave

2 days

This work has been in the pipeline for a while -- we started it before constitutional classifiers came out, and it was a big inspiration for our Claude Opus 4 jailbreak (new results coming out soon -- we're giving Anthropic time to fix things first).

1

0

Adam Gleave

@ARGleave

2 days

I started this research project quite skeptical that we'll be able to get robust LLMs. I'm now meaningfully more optimistic: defense-in-depth worked better than I expected, and there's been a bunch of other innovations like circuit-breakers that have come out in the meantime.

1

0

Adam Gleave

@ARGleave

2 days

Progress in robustness is just in time with new security threats from growing agent deployments and growing misuse risks from emerging model capabilities.

1

0

Adam Gleave

@ARGleave

2 days

With SOTA defenses LLMs can be difficult even for experts to exploit. Yet developers often compromise on defenses to retain performance (e.g. low-latency). This paper shows how these compromises can be used to break models – and how to securely implement defenses.

FAR.AI

@farairesearch

2 days

1/."Swiss cheese security", stacking layers of imperfect defenses, is a key part of AI companies' plans to safeguard models, and is used to secure Anthropic's Opus 4 model. Our new STACK attack breaks each layer in turn, highlighting this approach may be less secure than hoped.

1

0

5

Adam Gleave

@ARGleave

2 days

Indeed, we find a transfer attack works -- but at quite a hit to attack success rate. That was between two somewhat similar open-weight models, so you can probably make it harder with more model diversity. Training models that aren't necessarily robust but have *uncorrelated*.

0

Adam Gleave

@ARGleave

2 days

The other challenge is that your components in a defense-in-depth pipeline are all ML models. So, it's not like an attacker has to guess a random password, they have to guess what exploits will affect (unseen, but predictable) models.

1

0

Adam Gleave

@ARGleave

2 days

The problem is then you go through from exponential to linear complexity: it's like being able to bruteforce a combination lock digit by digit rather than having to guess the whole code.

1

0

Adam Gleave

@ARGleave

2 days

It's really easy to leak information from your defense-in-depth pipeline, like which component blocked an input, and indeed a lot of existing implementations don't even really try to protect against this.

1

0

Adam Gleave

@ARGleave

2 days

The bad news is as with most security-critical algorithms, implementation details matter a lot. This is just not something the AI community (or most startups) are used to, where engineering standards are *ahem*. mixed?.

1

0

Adam Gleave

@ARGleave

2 days

The good news is simply layering defenses can help quite a bit: we take some off-the-shelf open weight models, prompt them, and defend against attacks like PAP that reliably exploit our (and many other) models.

1

0

Adam Gleave

@ARGleave

2 days

This work has been in the pipeline for a while -- we started it before constitutional classifiers came out, and it was a big inspiration for our Claude Opus 4 jailbreak (new results coming out soon -- we're giving Anthropic time to fix things first).

Adam Gleave

@ARGleave

1 month

My colleague @irobotmckenzie spent six hours red-teaming Claude 4 Opus, and easily bypassed safeguards designed to block WMD development. Claude gave >15 pages of non-redundant instructions for sarin gas, describing all key steps in the manufacturing process.

1

0

Adam Gleave

@ARGleave

2 days

I started this research project quite skeptical that we'll be able to get robust LLMs. I'm now meaningfully more optimistic: defense-in-depth worked better than I expected, and there's been a bunch of other innovations like circuit-breakers that have come out in the meantime.

1

0

Adam Gleave

@ARGleave

2 days

Progress in robustness is just in time with new security threats from growing agent deployments and growing misuse risks from emerging model capabilities.

1

0

Adam Gleave

@ARGleave

2 days

RT @farairesearch: 💡 15 featured talks on AI governance: . @AlexBores @hlntnr @ohlennart @bradrcarson @ben_s_bucknall @RepBillFoster @MarkB….

0

3

0