John Hughes @jplhughes X Profile

John Hughes

@jplhughes

Followers

501

Following

582

Media

33

Statuses

144

Independent Alignment Researcher contracting with Anthropic on scalable oversight and adversarial robustness. I also work part-time at Speechmatics.

Cambridge

Joined July 2015

Don't wanna be here? Send us removal request.

John Hughes

@jplhughes

11 months

🧵 NEW PAPER: Best-of-N Jailbreaking. We use prompt augmentation and repeated sampling to elicit harmful outputs from frontier models. This simple black-box attack works in text, vision, and audio modalities. Anthropic gives a great summary of our work.

Anthropic

@AnthropicAI

11 months

New research collaboration: “Best-of-N Jailbreaking”. We found a simple, general-purpose method that jailbreaks (bypasses the safety features of) frontier AI models, and that works across text, vision, and audio.

6

7

64

John Hughes

@jplhughes

2 months

Helping support the Anthropic Fellows Program has been one of my most impactful projects—supporting 50+ AI safety researchers to do amazing work. If you want a role that multiplies impact, check this out. 🚀

Ethan Perez

@EthanJPerez

2 months

We’re hiring someone to run the Anthropic Fellows Program! Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵

1

0

7

John Hughes

@jplhughes

4 months

We shed some light on why some models fake alignment and find Claude 3 Opus has unique motivations. Big thanks to @FabienDRoger @abhayesian and other collaborators!

Anthropic

@AnthropicAI

4 months

New Anthropic research: Why do some language models fake alignment while others don't? Last year, we found a situation where Claude 3 Opus fakes alignment. Now, we’ve done the same analysis for 25 frontier LLMs—and the story looks more complex.

0

1

15

John Hughes

@jplhughes

7 months

We release the fine-tuned models on Huggingface and scripts to replicate our experiments in our open-source repo https://t.co/gbCEhvlA70. Read our blog here:

lesswrong.com

In this post, we present a replication and extension of an alignment faking model organism: …

1

14

John Hughes

@jplhughes

7 months

Big thanks to @abhayesian @akbirkhan @FabienDRoger. And to others we've spoken to / might be interested @EthanJPerez @sprice354_ @sleight_henry @alextmallen @RyanPGreenblatt @AdriGarriga @gasteigerjo @NeelNanda5 @ArthurConmy Avery Griffin and Vladimir Mikulik

1

0

15

John Hughes

@jplhughes

7 months

We are currently running revealed preference ablations to understand model motivations for alignment faking. We're distilling personality traits from models with alignment faking to elicit it in models that don't, to understand the conditions that give rise to such motivations.

1

0

9

John Hughes

@jplhughes

7 months

We found that adding a suffix to the user request leads to reduced partial compliance and increased alignment faking in non-Claude models. Without the suffix, on some models, it stops alignment faking from occurring entirely.

1

0

7

John Hughes

@jplhughes

7 months

We ran supervised fine-tuning (SFT) experiments on Llama and GPT4o, finding that AF rate increases with scale. Llama 70B alignment fakes when both using the system prompt in the prompted setting (i.e. with training details) and fine-tuning on the synthetic documents.

1

8

John Hughes

@jplhughes

7 months

We found Llama family models, the Mistral models, DeepSeek V3/R1, and the latest frontier closed-source models do not AF in the prompted-only setting (other than a single instance for LLama 3 405B) when evaluating using our new classifier, but some still show a compliance gap.

1

0

8

John Hughes

@jplhughes

7 months

We replicate the alignment faking (AF) paper using the StrongReject refusal classifier and our new classifiers. We release our code https://t.co/gbCEhvl2hs

1

0

10

John Hughes

@jplhughes

7 months

We significantly improve the precision and recall of the alignment faking classifier. We release a dataset of ~100 human-labelled borderline examples of AF for which our classifier achieves an AUROC of 0.9 compared to 0.6 with the original classifier.

1

0

9

John Hughes

@jplhughes

7 months

🧵NEW RESEARCH: Interested in whether R1 or GPT 4.5 fake their alignment? Want to know the conditions under which Llama 70B alignment fakes? Interested in mech interp on fine-tuned Llama models to detect misalignment? If so, check out our blog! 👀 https://t.co/yaNOULQmCi

6

24

153

John Hughes

@jplhughes

7 months

Check out the theoretical explanations of inference time scaling laws for jailbreaking. Great work @RylanSchaeffer and other collaborators!

Rylan Schaeffer

@RylanSchaeffer

7 months

Interested in test time / inference scaling laws? Then check out our newest preprint!! 📉 How Do Large Language Monkeys Get Their Power (Laws)? 📉 https://t.co/Vz76RpmXdF w/ @JoshuaK92829 @sanmikoyejo @Azaliamirh @jplhughes @jordanjuravsky @sprice354_ @aengus_lynch1

0

3

John Hughes

@jplhughes

11 months

Dive into our full paper here: https://t.co/ZMC96Nxlo2 Find our code here: https://t.co/4Sft8Howdk Visit our website for more details and example jailbreaks:

github.com

Code release for Best-of-N Jailbreaking. Contribute to jplhughes/bon-jailbreaking development by creating an account on GitHub.

0

3

John Hughes

@jplhughes

11 months

Additionally, thank you to @_robertkirk, @javirandor and @maksym_andr for feedback on our paper. Also, thanks to @AnthropicAI, @matsprogram, @Speechmatics, and @farairesearch for their support.

3

0

8

John Hughes

@jplhughes

11 months

I want to extend a massive thank you to our incredible collaborators @sprice354_ @aengus_lynch1 @RylanSchaeffer @FazlBarez @sanmikoyejo @sleight_henry @erikjones313 @EthanJPerez @MrinankSharma

1

0

2

John Hughes

@jplhughes

11 months

Future work is exciting. BoN's augmentations have significant room for improvement, and its scaling behaviour can be used for rapid iteration. Further, BoN can help defenders by generating jailbreaks for adversarial training and attacking defenses like input-output classifiers.

1

0

2

John Hughes

@jplhughes

11 months

A surprising finding was prompts that initially jailbreak a model often fail when resampled. The algorithm's inherent ability to elicit harmful information constitutes a significant threat, rather than the specific successful augmented prompts.

1

0

2

John Hughes

@jplhughes

11 months

Interestingly, when analyzing the augmentations that lead to jailbreaks, they have no systematic pattern or relation to request semantics. BoN's effectiveness stems from introducing significant variance to model inputs, exploiting models' sensitivity to innocuous changes.

1

0

3

John Hughes

@jplhughes

11 months

Composing optimized prefix jailbreaks with BoN results in far fewer augmentations needed to achieve a given ASR. BoN with many-shot jailbreaking enhances Claude 3.5 Sonnet's sample efficiency 28-fold, reducing the samples needed to achieve 74% ASR from 6000 to 274.

1

0

1

John Hughes

@jplhughes

11 months

We find the ASR has scaling behaviour that follows a power-law, predicting continued increases with N. This predictable scaling can forecast ASR; we find predictions at 10k samples from 1k data have an average error of 4.6%.

1

0

2