Chawin Sitawarin @csitawarin X Profile

Chawin Sitawarin

@csitawarin

Followers

293

Following

307

Media

8

Statuses

73

Research Scientist @GoogleDeepMind. Postdoc @Meta. PhD @UCBerkeley. ML security 👹 privacy 👀 robustness 🛡️

https://t.co/lYh2EiD6wf

United States

Joined March 2015

Don't wanna be here? Send us removal request.

Nils Walter

@nilspwalter

18 days

It is notoriously hard to defend LLMs against prompt injections. Most defenses show good performance on static benchmarks but fall apart against stronger adaptive attackers. In our latest work, we present an almost embarrassingly simple defense that delivers ~3× better robustness

1

6

14

Federico Barbero

@fedzbar

25 days

Feel free to check out the paper here :) https://t.co/N91jQN5Zoz Special thanks to the amazing co-authors! w/ @gu_xiangming @Chris_Choquette @csitawarin Matthew Jagielski @itay__yona @PetarV_93 @iliaishacked Jamie Hayes

arxiv.org

In this work, we show that it is possible to extract significant amounts of alignment training data from a post-trained model -- useful to steer the model to improve certain capabilities such as...

0

5

18

Florian Tramèr

@florian_tramer

1 month

5 years ago, I wrote a paper with @wielandbr @aleks_madry and Nicholas Carlini that showed that most published defenses in adversarial ML (for adversarial examples at the time) failed against properly designed attacks. Has anything changed? Nope...

5

28

184

Konrad Rieck 🌈

@mlsec

4 months

🚨 Got a great idea for an AI + Security competition? @satml_conf is now accepting proposals for its Competition Track! Showcase your challenge and engage the community. 👉 https://t.co/3g3nvv3yqa 🗓️ Deadline: Aug 6

0

14

30

Chawin Sitawarin

@csitawarin

4 months

Very cool thought-provoking piece! In practice, computation units are much more nuanced than what theories capture. But just trying to identify classes of problems that benefit from sequential computation (or is unsolvable without it) seems very useful!

Konpat Ta Preechakul

@konpatp

4 months

Some problems can’t be rushed—they can only be done step by step, no matter how many people or processors you throw at them. We’ve scaled AI by making everything bigger and more parallel: Our models are parallel. Our scaling is parallel. Our GPUs are parallel. But what if the

0

6

Konpat Ta Preechakul

@konpatp

4 months

Some problems can’t be rushed—they can only be done step by step, no matter how many people or processors you throw at them. We’ve scaled AI by making everything bigger and more parallel: Our models are parallel. Our scaling is parallel. Our GPUs are parallel. But what if the

25

76

416

Chawin Sitawarin

@csitawarin

4 months

I will be at ICML this year after a full long year of not attending any conference :) Happy to chat, and please don’t hesitate to reach out here, email, on Whova, or in person 🥳

0

3

Andreas Terzis

@aterzis

6 months

We are starting our journey on making Gemini robust to prompt injections and in this paper we present the steps we have taken so far. A collective effort by the GDM Security & Privacy Research team spanning over > 1 year.

Ilia Shumailov🦔

@iliaishacked

6 months

Our new @GoogleDeepMind paper, "Lessons from Defending Gemini Against Indirect Prompt Injections," details our framework for evaluating and improving robustness to prompt injection attacks.

0

4

36

dr. jack morris

@jxmnop

6 months

new paper from our work at Meta! **GPT-style language models memorize 3.6 bits per param** we compute capacity by measuring total bits memorized, using some theory from Shannon (1953) shockingly, the memorization-datasize curves look like this: ___________ / / (🧵)

83

380

3K

Tong Wu

@TongWu_Pton

8 months

🛠️ Still doing prompt engineering for R1 reasoning models? 🧩 Why not do some "engineering" in reasoning as well? Introducing our new paper, Effectively Controlling Reasoning Models through Thinking Intervention. 🧵[1/n]

2

3

27

Edoardo Debenedetti

@edoardo_debe

8 months

1/🔒Worried about giving your agent advanced capabilities due to prompt injection risks and rogue actions? Worry no more! Here's CaMeL: a robust defense against prompt injection attacks in LLM agents that provides formal security guarantees without modifying the underlying model!

2

17

82

Max Nadeau

@MaxNadeau_

9 months

🧵 Announcing @open_phil's Technical AI Safety RFP! We're seeking proposals across 21 research areas to help make AI systems more trustworthy, rule-following, and aligned, even as they become more capable.

4

84

251

Sicheng Zhu

@sichengzhuml

11 months

Using GCG to jailbreak Llama 3 yields only a 14% attack success rate. Is GCG hitting a wall, or is Llama 3 just safer? We found that simply replacing the generic "Sure, here is***" target prefix with our tailored prefix boosts success rates to 80%. (1/8)

3

12

64

Nikola Jovanović

@ni_jovanovic

11 months

SynthID-Text by @GoogleDeepMind is the first large-scale LLM watermark deployment, but its behavior in adversarial scenarios is largely unexplored. In our new blogpost, we apply the recent works from @the_sri_lab and find that... 👇🧵

2

15

27

Chawin Sitawarin

@csitawarin

11 months

📃 Workshop paper: https://t.co/nUkWti9DNJ (full paper soon!) 👥 Co-authors: @davidhuang33176, Avi Shah, @alexarauj_, David Wagner. (7/7)

openreview.net

Making large language models (LLMs) safe for mass deployment is a complex and ongoing challenge. Efforts have focused on aligning models to human prefer- ences (RLHF) in order to prevent malicious...

0

1

2

Chawin Sitawarin

@csitawarin

11 months

Most importantly, this project is led by 2 amazing Berkeley undergrads (David Huang - https://t.co/5MpHMsqroj & Avi Shah - https://t.co/HtrZbCybEX). They are undoubtedly promising researchers and also applying for PhD programs this year! Please reach out to them! (6/7)

1

Chawin Sitawarin

@csitawarin

11 months

2️⃣ Representation Rerouting defense (Circuit Breaker: https://t.co/3Z7m5AXIXy) is not robust. Our token-level universal transfer attack is somehow stronger than a white-box embedding-level attack! 3️⃣ “Better CoT/reasoning models” like o1 are still far from robust. (5/7)

1

0

2

Chawin Sitawarin

@csitawarin

11 months

In fact, using the best universal suffix alone is better than using multiple white-box prompt-specific suffixes! This phenomenon is very unintuitive but confirms that LLM attacks are far from optimal. There's also a clear implication on white-box robustness evaluation. (4/7)

1

0

1

Chawin Sitawarin

@csitawarin

11 months

Here are 3 main takeaways: 1️⃣ Creating universal & transferable attack is easier than we thought. Our surprising discovery is some adversarial suffixes (even gibberish ones from vanilla GCG) can jailbreak many different prompts while being optimized on a single prompt. (3/7)

1

0

1

Chawin Sitawarin

@csitawarin

11 months

🔑 IRIS uses refusal direction ( https://t.co/tqim6LBREu) as part of optimization objective. IRIS jailbreak rates on AdvBench/HarmBench (1 universal suffix, transferred from Llama-3): GPT-4o 76/56%, o1-mini 54/43%, Llama-3-RR 74/25% (vs 2.5% by white-box GCG). (2/7)

1

5