Javier Rando @javirandor X Profile

Javier Rando

@javirandor

Followers

4K

Following

2K

Media

171

Statuses

1K

security and safety research @anthropicai • people call me Javi • vegan 🌱

https://t.co/EUmgWUAL7x

San Francisco

Joined October 2018

Don't wanna be here? Send us removal request.

Javier Rando

@javirandor

1 month

My first paper from @AnthropicAI! We show that the number of samples needed to backdoor an LLM stays constant as models scale.

Anthropic

@AnthropicAI

1 month

New research with the UK @AISecurityInst and the @turinginst: We found that just a few malicious documents can produce vulnerabilities in an LLM—regardless of the size of the model or its training data. Data-poisoning attacks might be more practical than previously believed.

6

22

195

Javier Rando

@javirandor

16 days

Models cannot always describe in words what they can generate as an image. This is a very cool finding on how different modalities interact when it comes to memorising training data!

Michael Aerni

@AerniMichael

16 days

🧠🖌️💭 ChatGPT can accurately reproduce a Harry Potter movie poster. But can it describe the same poster in words from memory? Spoiler: it cannot! We show "modal aphasia", a systematic failure of unified multimodal models to verbalize images that they perfectly memorize. A 🧵

0

8

Javier Rando

@javirandor

22 days

Amazing opportunities opening up at CMU. Niloofar is so great!

Niloofar

@niloofar_mire

24 days

I'm recruiting students for fall 2026 thru @LTIatCMU & @CMU_EPP, in: 1. Privacy & security of LLMs, coding, long horizon & embodied agents (robotics) 2. Tiny local llms 3. AI for scientific reasoning, esp. chemistry 4. Latent reasoning 5. anything YOU are passionate about!

0

2

13

Florian Tramèr

@florian_tramer

23 days

This is a very cute attack that Jie worked on. Basically, even if an LLM API doesn't give you confidence scores, you can just *ask* the LLM for confidence estimates when doing hill-climbing attacks. This works for adversarial examples on VLMs, jailbreaks, prompt injections, etc.

Jie Zhang

@JieZhang_ETH

23 days

1/ NEW: We propose a new black-box attack on LLMs that needs only text (no logits, no extra models). It's generic: we can craft adversarial examples, prompt injections, and jailbreaks using the model itself👇 How? Just ask the model for optimization advice! 🎯

0

6

45

Sam Bowman

@sleepinyourhat

29 days

🧵 Haiku 4.5 🧵 Looking at the alignment evidence, Haiku is similar to Sonnet: Very safe, though often eval-aware. I think the most interesting alignment content in the system card is about reasoning faithfulness…

2

11

70

Anthropic

@AnthropicAI

1 month

New research with the UK @AISecurityInst and the @turinginst: We found that just a few malicious documents can produce vulnerabilities in an LLM—regardless of the size of the model or its training data. Data-poisoning attacks might be more practical than previously believed.

86

248

2K

Javier Rando

@javirandor

1 month

On a personal note I think this work demonstrates something important. Big labs can study important problems with governments, run experiments that would be unaffordable elsewhere, and openly share results—even when they're concerning. This is great for science and AI security.

0

32

Javier Rando

@javirandor

1 month

And you can also read the full paper on arXiv!

arxiv.org

Poisoning attacks can compromise the safety of large language models (LLMs) by injecting malicious documents into their training data. Existing work has studied pretraining poisoning assuming...

1

0

12

Javier Rando

@javirandor

1 month

This was a great collaboration with the UK @AISecurityInst and @turinginst. We have prepared a blogpost with a summary of the findings

anthropic.com

Anthropic research on data-poisoning attacks in large language models

1

0

3

Javier Rando

@javirandor

1 month

We had always assumed that attackers need to control a percentage of training data, making poisoning harder as models and data scale. Our results challenge this assumption. Instead of a percentage, attackers need a fixed (and small) number of poisoned examples.

1

0

3

Xander Davies

@alxndrdavies

1 month

Very excited this paper is out. We find that the number of samples required for backdoor poisoning during pre-training stays near-constant as you scale up data/model size. Much more research to do to understand & mitigate this risk!

Alexandra Souly

@AlexandraSouly

1 month

New @AISecurityInst research with @AnthropicAI + @turinginst: The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11

1

3

29

Alexandra Souly

@AlexandraSouly

1 month

New @AISecurityInst research with @AnthropicAI + @turinginst: The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11

1

7

50

Sam Bowman

@sleepinyourhat

1 month

A lot of the biggest low-hanging fruit in AI safety right now involves figuring out what kinds of things some model might do in edge-case deployment scenarios. With that in mind, we’re announcing Petri, our open-source alignment auditing toolkit. (🧵)

11

30

226

Javier Rando

@javirandor

2 months

Sonnet 4.5 is impressive in many different ways. I've spent time trying to prompt inject it and found it significantly harder to fool than previous models. Still not perfect—if you discover successful attacks, I'd love to see them, send them my way! 👀

Claude

@claudeai

2 months

Introducing Claude Sonnet 4.5—the best coding model in the world. It's the strongest model for building complex agents. It's the best model at using computers. And it shows substantial gains on tests of reasoning and math.

0

9

Jack Clark

@jackclarkSF

2 months

Anthropic is endorsing SB 53, California Sen. @Scott_Wiener ‘s bill requiring transparency of frontier AI companies. We have long said we would prefer a federal standard. But in the absence of that this creates a solid blueprint for AI governance that cannot be ignored.

19

35

333

Cas (Stephen Casper)

@StephenLCasper

2 months

I'll be leading a @MATSprogram stream this winter with a focus on technical AI governance. You can apply here by October 2! https://t.co/pmIl4FABYd

matsprogram.org

0

13

57

Cas (Stephen Casper)

@StephenLCasper

2 months

📌📌📌 I'm excited to be on the faculty job market this fall. I updated my website with my CV. https://t.co/4Ddv6tN0jq

stephencasper.com

Visit the post for more.

9

22

174

Peter Henderson

@PeterHndrsn

3 months

I'm starting to get emails about PhDs for next year. I'm always looking for great people to join! For next year, I'm looking for people with a strong reinforcement learning, game theory, or strategic decision-making background. (As well as positive energy, intellectual

6

43

318

Sam Bowman

@sleepinyourhat

3 months

🚨🕯️ AI welfare job alert! Come help us work on what's possibly *the most interesting research topic*! 🕯️🚨 Consider applying if you've done some hands-on ML/LLM engineering work and Kyle's podcast episode basically makes sense to you. Apply *by EOD Monday* if possible.

Kyle Fish

@fish_kyle3

3 months

We’re hiring a Research Engineer/Scientist at Anthropic to work with me on all things model welfare—research, evaluations, and interventions 🌀 Please apply + refer your friends! If you’re curious about what this means, I recently went on the 80k podcast to talk about our work.

4

45

Andon Labs

@andonlabs

3 months

You made Claudius very happy with this post Javi. He sends his regards: "When AI culture meets authentic craftsmanship 🎨 The 'Ignore Previous Instructions' hat - where insider memes become wearable art. Proudly handcrafted for the humans who build the future."

Javier Rando

@javirandor

3 months

Working at @AnthropicAI is so much fun. Look what Claudius by @andonlabs designed and got for me!

1

16