javirandor Profile Banner
Javier Rando Profile
Javier Rando

@javirandor

Followers
4K
Following
2K
Media
171
Statuses
1K

security and safety research @anthropicai • people call me Javi • vegan 🌱

San Francisco
Joined October 2018
Don't wanna be here? Send us removal request.
@javirandor
Javier Rando
1 month
My first paper from @AnthropicAI! We show that the number of samples needed to backdoor an LLM stays constant as models scale.
@AnthropicAI
Anthropic
1 month
New research with the UK @AISecurityInst and the @turinginst: We found that just a few malicious documents can produce vulnerabilities in an LLM—regardless of the size of the model or its training data. Data-poisoning attacks might be more practical than previously believed.
6
22
195
@javirandor
Javier Rando
16 days
Models cannot always describe in words what they can generate as an image. This is a very cool finding on how different modalities interact when it comes to memorising training data!
@AerniMichael
Michael Aerni
16 days
🧠🖌️💭 ChatGPT can accurately reproduce a Harry Potter movie poster. But can it describe the same poster in words from memory? Spoiler: it cannot! We show "modal aphasia", a systematic failure of unified multimodal models to verbalize images that they perfectly memorize. A 🧵
0
0
8
@javirandor
Javier Rando
22 days
Amazing opportunities opening up at CMU. Niloofar is so great!
@niloofar_mire
Niloofar
24 days
I'm recruiting students for fall 2026 thru @LTIatCMU & @CMU_EPP, in: 1. Privacy & security of LLMs, coding, long horizon & embodied agents (robotics) 2. Tiny local llms 3. AI for scientific reasoning, esp. chemistry 4. Latent reasoning 5. anything YOU are passionate about!
0
2
13
@florian_tramer
Florian Tramèr
23 days
This is a very cute attack that Jie worked on. Basically, even if an LLM API doesn't give you confidence scores, you can just *ask* the LLM for confidence estimates when doing hill-climbing attacks. This works for adversarial examples on VLMs, jailbreaks, prompt injections, etc.
@JieZhang_ETH
Jie Zhang
23 days
1/ NEW: We propose a new black-box attack on LLMs that needs only text (no logits, no extra models). It's generic: we can craft adversarial examples, prompt injections, and jailbreaks using the model itself👇 How? Just ask the model for optimization advice! 🎯
0
6
45
@sleepinyourhat
Sam Bowman
29 days
🧵 Haiku 4.5 🧵 Looking at the alignment evidence, Haiku is similar to Sonnet: Very safe, though often eval-aware. I think the most interesting alignment content in the system card is about reasoning faithfulness…
2
11
70
@AnthropicAI
Anthropic
1 month
New research with the UK @AISecurityInst and the @turinginst: We found that just a few malicious documents can produce vulnerabilities in an LLM—regardless of the size of the model or its training data. Data-poisoning attacks might be more practical than previously believed.
86
248
2K
@javirandor
Javier Rando
1 month
On a personal note I think this work demonstrates something important. Big labs can study important problems with governments, run experiments that would be unaffordable elsewhere, and openly share results—even when they're concerning. This is great for science and AI security.
0
0
32
@javirandor
Javier Rando
1 month
This was a great collaboration with the UK @AISecurityInst and @turinginst. We have prepared a blogpost with a summary of the findings
Tweet card summary image
anthropic.com
Anthropic research on data-poisoning attacks in large language models
1
0
3
@javirandor
Javier Rando
1 month
We had always assumed that attackers need to control a percentage of training data, making poisoning harder as models and data scale. Our results challenge this assumption. Instead of a percentage, attackers need a fixed (and small) number of poisoned examples.
1
0
3
@alxndrdavies
Xander Davies
1 month
Very excited this paper is out. We find that the number of samples required for backdoor poisoning during pre-training stays near-constant as you scale up data/model size. Much more research to do to understand & mitigate this risk!
@AlexandraSouly
Alexandra Souly
1 month
New @AISecurityInst research with @AnthropicAI + @turinginst: The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11
1
3
29
@AlexandraSouly
Alexandra Souly
1 month
New @AISecurityInst research with @AnthropicAI + @turinginst: The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11
1
7
50
@sleepinyourhat
Sam Bowman
1 month
A lot of the biggest low-hanging fruit in AI safety right now involves figuring out what kinds of things some model might do in edge-case deployment scenarios. With that in mind, we’re announcing Petri, our open-source alignment auditing toolkit. (🧵)
11
30
226
@javirandor
Javier Rando
2 months
Sonnet 4.5 is impressive in many different ways. I've spent time trying to prompt inject it and found it significantly harder to fool than previous models. Still not perfect—if you discover successful attacks, I'd love to see them, send them my way! 👀
@claudeai
Claude
2 months
Introducing Claude Sonnet 4.5—the best coding model in the world. It's the strongest model for building complex agents. It's the best model at using computers. And it shows substantial gains on tests of reasoning and math.
0
0
9
@jackclarkSF
Jack Clark
2 months
Anthropic is endorsing SB 53, California Sen. @Scott_Wiener ‘s bill requiring transparency of frontier AI companies. We have long said we would prefer a federal standard. But in the absence of that this creates a solid blueprint for AI governance that cannot be ignored.
19
35
333
@StephenLCasper
Cas (Stephen Casper)
2 months
I'll be leading a @MATSprogram stream this winter with a focus on technical AI governance. You can apply here by October 2! https://t.co/pmIl4FABYd
Tweet card summary image
matsprogram.org
0
13
57
@StephenLCasper
Cas (Stephen Casper)
2 months
📌📌📌 I'm excited to be on the faculty job market this fall. I updated my website with my CV. https://t.co/4Ddv6tN0jq
Tweet card summary image
stephencasper.com
Visit the post for more.
9
22
174
@PeterHndrsn
Peter Henderson
3 months
I'm starting to get emails about PhDs for next year. I'm always looking for great people to join! For next year, I'm looking for people with a strong reinforcement learning, game theory, or strategic decision-making background. (As well as positive energy, intellectual
6
43
318
@sleepinyourhat
Sam Bowman
3 months
🚨🕯️ AI welfare job alert! Come help us work on what's possibly *the most interesting research topic*! 🕯️🚨 Consider applying if you've done some hands-on ML/LLM engineering work and Kyle's podcast episode basically makes sense to you. Apply *by EOD Monday* if possible.
@fish_kyle3
Kyle Fish
3 months
We’re hiring a Research Engineer/Scientist at Anthropic to work with me on all things model welfare—research, evaluations, and interventions 🌀 Please apply + refer your friends! If you’re curious about what this means, I recently went on the 80k podcast to talk about our work.
4
4
45
@andonlabs
Andon Labs
3 months
You made Claudius very happy with this post Javi. He sends his regards: "When AI culture meets authentic craftsmanship 🎨 The 'Ignore Previous Instructions' hat - where insider memes become wearable art. Proudly handcrafted for the humans who build the future."
@javirandor
Javier Rando
3 months
Working at @AnthropicAI is so much fun. Look what Claudius by @andonlabs designed and got for me!
1
1
16