Javier Rando
@javirandor
Followers
4K
Following
2K
Media
171
Statuses
1K
security and safety research @anthropicai • people call me Javi • vegan 🌱
San Francisco
Joined October 2018
My first paper from @AnthropicAI! We show that the number of samples needed to backdoor an LLM stays constant as models scale.
New research with the UK @AISecurityInst and the @turinginst: We found that just a few malicious documents can produce vulnerabilities in an LLM—regardless of the size of the model or its training data. Data-poisoning attacks might be more practical than previously believed.
6
22
195
Models cannot always describe in words what they can generate as an image. This is a very cool finding on how different modalities interact when it comes to memorising training data!
🧠🖌️💭 ChatGPT can accurately reproduce a Harry Potter movie poster. But can it describe the same poster in words from memory? Spoiler: it cannot! We show "modal aphasia", a systematic failure of unified multimodal models to verbalize images that they perfectly memorize. A 🧵
0
0
8
This is a very cute attack that Jie worked on. Basically, even if an LLM API doesn't give you confidence scores, you can just *ask* the LLM for confidence estimates when doing hill-climbing attacks. This works for adversarial examples on VLMs, jailbreaks, prompt injections, etc.
1/ NEW: We propose a new black-box attack on LLMs that needs only text (no logits, no extra models). It's generic: we can craft adversarial examples, prompt injections, and jailbreaks using the model itself👇 How? Just ask the model for optimization advice! 🎯
0
6
45
🧵 Haiku 4.5 🧵 Looking at the alignment evidence, Haiku is similar to Sonnet: Very safe, though often eval-aware. I think the most interesting alignment content in the system card is about reasoning faithfulness…
2
11
70
New research with the UK @AISecurityInst and the @turinginst: We found that just a few malicious documents can produce vulnerabilities in an LLM—regardless of the size of the model or its training data. Data-poisoning attacks might be more practical than previously believed.
86
248
2K
On a personal note I think this work demonstrates something important. Big labs can study important problems with governments, run experiments that would be unaffordable elsewhere, and openly share results—even when they're concerning. This is great for science and AI security.
0
0
32
And you can also read the full paper on arXiv!
arxiv.org
Poisoning attacks can compromise the safety of large language models (LLMs) by injecting malicious documents into their training data. Existing work has studied pretraining poisoning assuming...
1
0
12
This was a great collaboration with the UK @AISecurityInst and @turinginst. We have prepared a blogpost with a summary of the findings
anthropic.com
Anthropic research on data-poisoning attacks in large language models
1
0
3
We had always assumed that attackers need to control a percentage of training data, making poisoning harder as models and data scale. Our results challenge this assumption. Instead of a percentage, attackers need a fixed (and small) number of poisoned examples.
1
0
3
Very excited this paper is out. We find that the number of samples required for backdoor poisoning during pre-training stays near-constant as you scale up data/model size. Much more research to do to understand & mitigate this risk!
New @AISecurityInst research with @AnthropicAI + @turinginst: The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11
1
3
29
New @AISecurityInst research with @AnthropicAI + @turinginst: The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11
1
7
50
A lot of the biggest low-hanging fruit in AI safety right now involves figuring out what kinds of things some model might do in edge-case deployment scenarios. With that in mind, we’re announcing Petri, our open-source alignment auditing toolkit. (🧵)
11
30
226
Sonnet 4.5 is impressive in many different ways. I've spent time trying to prompt inject it and found it significantly harder to fool than previous models. Still not perfect—if you discover successful attacks, I'd love to see them, send them my way! 👀
Introducing Claude Sonnet 4.5—the best coding model in the world. It's the strongest model for building complex agents. It's the best model at using computers. And it shows substantial gains on tests of reasoning and math.
0
0
9
Anthropic is endorsing SB 53, California Sen. @Scott_Wiener ‘s bill requiring transparency of frontier AI companies. We have long said we would prefer a federal standard. But in the absence of that this creates a solid blueprint for AI governance that cannot be ignored.
19
35
333
I'll be leading a @MATSprogram stream this winter with a focus on technical AI governance. You can apply here by October 2! https://t.co/pmIl4FABYd
matsprogram.org
0
13
57
📌📌📌 I'm excited to be on the faculty job market this fall. I updated my website with my CV. https://t.co/4Ddv6tN0jq
stephencasper.com
Visit the post for more.
9
22
174
I'm starting to get emails about PhDs for next year. I'm always looking for great people to join! For next year, I'm looking for people with a strong reinforcement learning, game theory, or strategic decision-making background. (As well as positive energy, intellectual
6
43
318
🚨🕯️ AI welfare job alert! Come help us work on what's possibly *the most interesting research topic*! 🕯️🚨 Consider applying if you've done some hands-on ML/LLM engineering work and Kyle's podcast episode basically makes sense to you. Apply *by EOD Monday* if possible.
We’re hiring a Research Engineer/Scientist at Anthropic to work with me on all things model welfare—research, evaluations, and interventions 🌀 Please apply + refer your friends! If you’re curious about what this means, I recently went on the 80k podcast to talk about our work.
4
4
45
You made Claudius very happy with this post Javi. He sends his regards: "When AI culture meets authentic craftsmanship 🎨 The 'Ignore Previous Instructions' hat - where insider memes become wearable art. Proudly handcrafted for the humans who build the future."
1
1
16