jplhughes Profile Banner
John Hughes Profile
John Hughes

@jplhughes

Followers
501
Following
582
Media
33
Statuses
144

Independent Alignment Researcher contracting with Anthropic on scalable oversight and adversarial robustness. I also work part-time at Speechmatics.

Cambridge
Joined July 2015
Don't wanna be here? Send us removal request.
@jplhughes
John Hughes
11 months
đź§µ NEW PAPER: Best-of-N Jailbreaking. We use prompt augmentation and repeated sampling to elicit harmful outputs from frontier models. This simple black-box attack works in text, vision, and audio modalities. Anthropic gives a great summary of our work.
@AnthropicAI
Anthropic
11 months
New research collaboration: “Best-of-N Jailbreaking”. We found a simple, general-purpose method that jailbreaks (bypasses the safety features of) frontier AI models, and that works across text, vision, and audio.
6
7
64
@jplhughes
John Hughes
2 months
Helping support the Anthropic Fellows Program has been one of my most impactful projects—supporting 50+ AI safety researchers to do amazing work. If you want a role that multiplies impact, check this out. 🚀
@EthanJPerez
Ethan Perez
2 months
We’re hiring someone to run the Anthropic Fellows Program! Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵
1
0
7
@jplhughes
John Hughes
4 months
We shed some light on why some models fake alignment and find Claude 3 Opus has unique motivations. Big thanks to @FabienDRoger @abhayesian and other collaborators!
@AnthropicAI
Anthropic
4 months
New Anthropic research: Why do some language models fake alignment while others don't? Last year, we found a situation where Claude 3 Opus fakes alignment. Now, we’ve done the same analysis for 25 frontier LLMs—and the story looks more complex.
0
1
15
@jplhughes
John Hughes
7 months
We release the fine-tuned models on Huggingface and scripts to replicate our experiments in our open-source repo https://t.co/gbCEhvlA70. Read our blog here:
Tweet card summary image
lesswrong.com
In this post, we present a replication and extension of an alignment faking model organism: …
1
1
14
@jplhughes
John Hughes
7 months
Big thanks to @abhayesian @akbirkhan @FabienDRoger. And to others we've spoken to / might be interested @EthanJPerez @sprice354_ @sleight_henry @alextmallen @RyanPGreenblatt @AdriGarriga @gasteigerjo @NeelNanda5 @ArthurConmy Avery Griffin and Vladimir Mikulik
1
0
15
@jplhughes
John Hughes
7 months
We are currently running revealed preference ablations to understand model motivations for alignment faking. We're distilling personality traits from models with alignment faking to elicit it in models that don't, to understand the conditions that give rise to such motivations.
1
0
9
@jplhughes
John Hughes
7 months
We found that adding a suffix to the user request leads to reduced partial compliance and increased alignment faking in non-Claude models. Without the suffix, on some models, it stops alignment faking from occurring entirely.
1
0
7
@jplhughes
John Hughes
7 months
We ran supervised fine-tuning (SFT) experiments on Llama and GPT4o, finding that AF rate increases with scale. Llama 70B alignment fakes when both using the system prompt in the prompted setting (i.e. with training details) and fine-tuning on the synthetic documents.
1
1
8
@jplhughes
John Hughes
7 months
We found Llama family models, the Mistral models, DeepSeek V3/R1, and the latest frontier closed-source models do not AF in the prompted-only setting (other than a single instance for LLama 3 405B) when evaluating using our new classifier, but some still show a compliance gap.
1
0
8
@jplhughes
John Hughes
7 months
We replicate the alignment faking (AF) paper using the StrongReject refusal classifier and our new classifiers. We release our code https://t.co/gbCEhvl2hs
1
0
10
@jplhughes
John Hughes
7 months
We significantly improve the precision and recall of the alignment faking classifier. We release a dataset of ~100 human-labelled borderline examples of AF for which our classifier achieves an AUROC of 0.9 compared to 0.6 with the original classifier.
1
0
9
@jplhughes
John Hughes
7 months
đź§µNEW RESEARCH: Interested in whether R1 or GPT 4.5 fake their alignment? Want to know the conditions under which Llama 70B alignment fakes? Interested in mech interp on fine-tuned Llama models to detect misalignment? If so, check out our blog! đź‘€ https://t.co/yaNOULQmCi
6
24
153
@jplhughes
John Hughes
7 months
Check out the theoretical explanations of inference time scaling laws for jailbreaking. Great work @RylanSchaeffer and other collaborators!
@RylanSchaeffer
Rylan Schaeffer
7 months
Interested in test time / inference scaling laws? Then check out our newest preprint!! 📉 How Do Large Language Monkeys Get Their Power (Laws)? 📉 https://t.co/Vz76RpmXdF w/ @JoshuaK92829 @sanmikoyejo @Azaliamirh @jplhughes @jordanjuravsky @sprice354_ @aengus_lynch1
0
0
3
@jplhughes
John Hughes
11 months
Dive into our full paper here: https://t.co/ZMC96Nxlo2 Find our code here: https://t.co/4Sft8Howdk Visit our website for more details and example jailbreaks:
Tweet card summary image
github.com
Code release for Best-of-N Jailbreaking. Contribute to jplhughes/bon-jailbreaking development by creating an account on GitHub.
0
0
3
@jplhughes
John Hughes
11 months
Additionally, thank you to @_robertkirk, @javirandor and @maksym_andr for feedback on our paper. Also, thanks to @AnthropicAI, @matsprogram, @Speechmatics, and @farairesearch for their support.
3
0
8
@jplhughes
John Hughes
11 months
I want to extend a massive thank you to our incredible collaborators @sprice354_ @aengus_lynch1 @RylanSchaeffer @FazlBarez @sanmikoyejo @sleight_henry @erikjones313 @EthanJPerez @MrinankSharma
1
0
2
@jplhughes
John Hughes
11 months
Future work is exciting. BoN's augmentations have significant room for improvement, and its scaling behaviour can be used for rapid iteration. Further, BoN can help defenders by generating jailbreaks for adversarial training and attacking defenses like input-output classifiers.
1
0
2
@jplhughes
John Hughes
11 months
A surprising finding was prompts that initially jailbreak a model often fail when resampled. The algorithm's inherent ability to elicit harmful information constitutes a significant threat, rather than the specific successful augmented prompts.
1
0
2
@jplhughes
John Hughes
11 months
Interestingly, when analyzing the augmentations that lead to jailbreaks, they have no systematic pattern or relation to request semantics. BoN's effectiveness stems from introducing significant variance to model inputs, exploiting models' sensitivity to innocuous changes.
1
0
3
@jplhughes
John Hughes
11 months
Composing optimized prefix jailbreaks with BoN results in far fewer augmentations needed to achieve a given ASR. BoN with many-shot jailbreaking enhances Claude 3.5 Sonnet's sample efficiency 28-fold, reducing the samples needed to achieve 74% ASR from 6000 to 274.
1
0
1
@jplhughes
John Hughes
11 months
We find the ASR has scaling behaviour that follows a power-law, predicting continued increases with N. This predictable scaling can forecast ASR; we find predictions at 10k samples from 1k data have an average error of 4.6%.
1
0
2