John Hughes
@jplhughes
Followers
501
Following
582
Media
33
Statuses
144
Independent Alignment Researcher contracting with Anthropic on scalable oversight and adversarial robustness. I also work part-time at Speechmatics.
Cambridge
Joined July 2015
đź§µ NEW PAPER: Best-of-N Jailbreaking. We use prompt augmentation and repeated sampling to elicit harmful outputs from frontier models. This simple black-box attack works in text, vision, and audio modalities. Anthropic gives a great summary of our work.
New research collaboration: “Best-of-N Jailbreaking”. We found a simple, general-purpose method that jailbreaks (bypasses the safety features of) frontier AI models, and that works across text, vision, and audio.
6
7
64
Helping support the Anthropic Fellows Program has been one of my most impactful projects—supporting 50+ AI safety researchers to do amazing work. If you want a role that multiplies impact, check this out. 🚀
We’re hiring someone to run the Anthropic Fellows Program! Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵
1
0
7
We shed some light on why some models fake alignment and find Claude 3 Opus has unique motivations. Big thanks to @FabienDRoger @abhayesian and other collaborators!
New Anthropic research: Why do some language models fake alignment while others don't? Last year, we found a situation where Claude 3 Opus fakes alignment. Now, we’ve done the same analysis for 25 frontier LLMs—and the story looks more complex.
0
1
15
We release the fine-tuned models on Huggingface and scripts to replicate our experiments in our open-source repo https://t.co/gbCEhvlA70. Read our blog here:
lesswrong.com
In this post, we present a replication and extension of an alignment faking model organism: …
1
1
14
Big thanks to @abhayesian @akbirkhan @FabienDRoger. And to others we've spoken to / might be interested @EthanJPerez @sprice354_ @sleight_henry @alextmallen @RyanPGreenblatt @AdriGarriga @gasteigerjo @NeelNanda5 @ArthurConmy Avery Griffin and Vladimir Mikulik
1
0
15
We are currently running revealed preference ablations to understand model motivations for alignment faking. We're distilling personality traits from models with alignment faking to elicit it in models that don't, to understand the conditions that give rise to such motivations.
1
0
9
We found that adding a suffix to the user request leads to reduced partial compliance and increased alignment faking in non-Claude models. Without the suffix, on some models, it stops alignment faking from occurring entirely.
1
0
7
We ran supervised fine-tuning (SFT) experiments on Llama and GPT4o, finding that AF rate increases with scale. Llama 70B alignment fakes when both using the system prompt in the prompted setting (i.e. with training details) and fine-tuning on the synthetic documents.
1
1
8
We found Llama family models, the Mistral models, DeepSeek V3/R1, and the latest frontier closed-source models do not AF in the prompted-only setting (other than a single instance for LLama 3 405B) when evaluating using our new classifier, but some still show a compliance gap.
1
0
8
We replicate the alignment faking (AF) paper using the StrongReject refusal classifier and our new classifiers. We release our code https://t.co/gbCEhvl2hs
1
0
10
We significantly improve the precision and recall of the alignment faking classifier. We release a dataset of ~100 human-labelled borderline examples of AF for which our classifier achieves an AUROC of 0.9 compared to 0.6 with the original classifier.
1
0
9
đź§µNEW RESEARCH: Interested in whether R1 or GPT 4.5 fake their alignment? Want to know the conditions under which Llama 70B alignment fakes? Interested in mech interp on fine-tuned Llama models to detect misalignment? If so, check out our blog! đź‘€ https://t.co/yaNOULQmCi
6
24
153
Check out the theoretical explanations of inference time scaling laws for jailbreaking. Great work @RylanSchaeffer and other collaborators!
Interested in test time / inference scaling laws? Then check out our newest preprint!! 📉 How Do Large Language Monkeys Get Their Power (Laws)? 📉 https://t.co/Vz76RpmXdF w/ @JoshuaK92829 @sanmikoyejo @Azaliamirh @jplhughes @jordanjuravsky @sprice354_ @aengus_lynch1
0
0
3
Dive into our full paper here: https://t.co/ZMC96Nxlo2 Find our code here: https://t.co/4Sft8Howdk Visit our website for more details and example jailbreaks:
github.com
Code release for Best-of-N Jailbreaking. Contribute to jplhughes/bon-jailbreaking development by creating an account on GitHub.
0
0
3
Additionally, thank you to @_robertkirk, @javirandor and @maksym_andr for feedback on our paper. Also, thanks to @AnthropicAI, @matsprogram, @Speechmatics, and @farairesearch for their support.
3
0
8
I want to extend a massive thank you to our incredible collaborators @sprice354_ @aengus_lynch1 @RylanSchaeffer @FazlBarez @sanmikoyejo @sleight_henry @erikjones313 @EthanJPerez @MrinankSharma
1
0
2
Future work is exciting. BoN's augmentations have significant room for improvement, and its scaling behaviour can be used for rapid iteration. Further, BoN can help defenders by generating jailbreaks for adversarial training and attacking defenses like input-output classifiers.
1
0
2
A surprising finding was prompts that initially jailbreak a model often fail when resampled. The algorithm's inherent ability to elicit harmful information constitutes a significant threat, rather than the specific successful augmented prompts.
1
0
2
Interestingly, when analyzing the augmentations that lead to jailbreaks, they have no systematic pattern or relation to request semantics. BoN's effectiveness stems from introducing significant variance to model inputs, exploiting models' sensitivity to innocuous changes.
1
0
3
Composing optimized prefix jailbreaks with BoN results in far fewer augmentations needed to achieve a given ASR. BoN with many-shot jailbreaking enhances Claude 3.5 Sonnet's sample efficiency 28-fold, reducing the samples needed to achieve 74% ASR from 6000 to 274.
1
0
1
We find the ASR has scaling behaviour that follows a power-law, predicting continued increases with N. This predictable scaling can forecast ASR; we find predictions at 10k samples from 1k data have an average error of 4.6%.
1
0
2