Tony Wang @TonyWangIV X Profile

Tony Wang

@TonyWangIV

Followers

677

Following

4K

Media

22

Statuses

274

Monitoring AI @ US CAISI, PhD student @MIT_CSAIL. Engineer at heart.

Washington, DC

Joined August 2017

Don't wanna be here? Send us removal request.

Tony Wang

@TonyWangIV

1 year

Sharing a new paper by some collaborators and myself. We show that it is possible to covertly jailbreak models by leveraging finetuning access. We call this Covert Malicious Finetuning (CMFT). If this vulnerability exists in future more powerful models, then safe deployment of.

Danny Halawi

@dannyhalawi15

1 year

New paper! We introduce Covert Malicious Finetuning (CMFT), a method for jailbreaking language models via fine-tuning that avoids detection. We use our method to covertly jailbreak GPT-4 via the OpenAI finetuning API.

1

3

24

Tony Wang

@TonyWangIV

1 year

Some more thoughts on this paper here:

lesswrong.com

This post discusses our recent paper Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation and comments on its implications for AI safety.

0

1

3

Grok

@grok

1 day

Join millions who have switched to Grok.

20

19

156

Tony Wang

@TonyWangIV

1 year

RT @OwainEvans_UK: New paper, surprising result:.We finetune an LLM on just (x,y) pairs from an unknown function f. Remarkably, the LLM can….

0

221

0

Tony Wang

@TonyWangIV

1 year

Some additional links. Paper: Website: Blog:

2

1

12

Tony Wang

@TonyWangIV

1 year

To end, I’d like to share a ballad about our work composed by Suno. Thank you for reading (and listening)!. Link to Suno song:

1

25

Tony Wang

@TonyWangIV

1 year

Finally, I am also quite excited about online / stateful defenses (. In many situations, an attacker must repeatedly interact with a target system in order to learn to exploit it. The idea of an online defense is to use these interactions as training data.

2

6

29

Tony Wang

@TonyWangIV

1 year

Another approach is to increase the amount of juice we can squeeze from every counterexample we do find. Here I am particularly excited about approaches like circuit-breaking (qt’d), which trains models to more deeply avoid behaviors compared to simply minimizing completion.

1

0

20

Tony Wang

@TonyWangIV

1 year

One approach is to relax the need for real examples of failures, which may be computationally expensive to find. Instead, one could find examples of failures that are unrealistic but nonetheless give useful signal when trained against. This idea is known as relaxed adversarial.

2

0

34

Tony Wang

@TonyWangIV

1 year

Roughly 18 months ago, my collaborators and I discovered that supposedly superhuman Go AIs can be defeated by humans using a simple adversarial strategy. In the time since, we've been testing a few baseline techniques for defending against this exploit. Unfortunately, in our new.

FAR.AI

@farairesearch

1 year

🛡 Is AI robustness possible, or are adversarial attacks unavoidable? We tested three defenses to make superhuman Go AIs robust. Our defenses manage to protect against known threats, but unfortunately new adversaries bypass them, sometimes using qualitatively new attacks! 🧵

26

111

768

Tony Wang

@TonyWangIV

1 year

RT @bshlgrs: ARC-AGI’s been hyped over the last week as a benchmark that LLMs can’t solve. This claim triggered my dear coworker Ryan Green….

0

177

0

Tony Wang

@TonyWangIV

1 year

RT @visakanv: The point isn’t to achieve the million views on YouTube . That’s truly just the exhaust coming out of the creative process. T….

0

1

0

Tony Wang

@TonyWangIV

2 years

@robbensinger I filled it out using @TetraspaceWest's really nice web version of the form:

0

1

Tony Wang

@TonyWangIV

2 years

Finally got around to filling out @robbensinger's AI x-risk chart!

Rob Bensinger ⏹️

@robbensinger

2 years

Here are some of my views on AI x-risk. I'm pretty sure these discussions would go way better if there was less "are you in the Rightthink Tribe, or the Wrongthink Tribe?", and more focus on specific claims. Maybe share your own version of this image, and start a conversation?

1

0

8

Tony Wang

@TonyWangIV

2 years

Very cool newsletter. Was pleasantly surprised that there seem to be people in China in positions of power that take AI safety seriously.

Yawen Duan

@yawen_duan

2 years

"What's going on with AI Safety in China?" – a question I've often encountered in the past year. Check out "AI Safety in China" - a newsletter by Concordia AI, sharing insights and updates on China's AI safety and governance landscape![. 1/n

0

4

Tony Wang

@TonyWangIV

2 years

I will be at ICML from 7/24 to 8/2, presenting on this work, learning new things, and just enjoying Hawaii. Hit me up if you wanna chat or hang out~.

FAR.AI

@farairesearch

2 years

This is Lee Sedol in 2016 playing against AlphaGo. Despite a valiant effort, Lee lost. The AI was just too powerful. But, had Lee known about our ICML 2023 paper, Adversarial Policies Beat Superhuman Go AIs, things might have turned out differently! �

1

12

Tony Wang

@TonyWangIV

2 years

RT @GreatTA1998: What people sometimes forget is, "reality checks" aren't scarce, but support and encouragement is.

0

1

0

Tony Wang

@TonyWangIV

2 years

A refreshingly novel theory of neural scaling laws just dropped. Fun read. (also seems plausible).

Eric J. Michaud

@ericjmichaud_

2 years

Understanding the origin of neural scaling laws and the emergence of new capabilities with scale is key to understanding what deep neural networks are learning. In our new paper, @tegmark, @ZimingLiu11, @uzpg_ and I develop a theory of neural scaling. 🧵:.

0

2

5

Tony Wang

@TonyWangIV

2 years

RT @ARGleave: Even SOTA ML systems are vulnerable to adversarial attack. How can we ensure AI systems are safe and beneficial when the syst….

alignmentforum.org

Even the most advanced contemporary machine learning systems are vulnerable to adversarial attack. The safety community has often assumed adversarial…

0

3

0

Tony Wang

@TonyWangIV

3 years

RT @moreisdifferent: 1/ Trying to signal boost w a short 🧵. Amateur Go player Kellin Pelrine can consistently beat "KataGo", an AI system t….

0

239

0

Tony Wang

@TonyWangIV

3 years

RT @benlandautaylor: 9/ Just because someone is describing a real problem doesn't mean their solutions are any good. Today most movements w….

0

5

0