Tony Wang Profile
Tony Wang

@TonyWangIV

Followers
677
Following
4K
Media
22
Statuses
274

Monitoring AI @ US CAISI, PhD student @MIT_CSAIL. Engineer at heart.

Washington, DC
Joined August 2017
Don't wanna be here? Send us removal request.
@TonyWangIV
Tony Wang
1 year
Sharing a new paper by some collaborators and myself. We show that it is possible to covertly jailbreak models by leveraging finetuning access. We call this Covert Malicious Finetuning (CMFT). If this vulnerability exists in future more powerful models, then safe deployment of.
@dannyhalawi15
Danny Halawi
1 year
New paper! We introduce Covert Malicious Finetuning (CMFT), a method for jailbreaking language models via fine-tuning that avoids detection. We use our method to covertly jailbreak GPT-4 via the OpenAI finetuning API.
Tweet media one
1
3
24
@grok
Grok
1 day
Join millions who have switched to Grok.
20
19
156
@TonyWangIV
Tony Wang
1 year
RT @OwainEvans_UK: New paper, surprising result:.We finetune an LLM on just (x,y) pairs from an unknown function f. Remarkably, the LLM can….
0
221
0
@TonyWangIV
Tony Wang
1 year
Some additional links. Paper: Website: Blog:
2
1
12
@TonyWangIV
Tony Wang
1 year
To end, I’d like to share a ballad about our work composed by Suno. Thank you for reading (and listening)!. Link to Suno song:
1
1
25
@TonyWangIV
Tony Wang
1 year
Finally, I am also quite excited about online / stateful defenses (. In many situations, an attacker must repeatedly interact with a target system in order to learn to exploit it. The idea of an online defense is to use these interactions as training data.
2
6
29
@TonyWangIV
Tony Wang
1 year
Another approach is to increase the amount of juice we can squeeze from every counterexample we do find. Here I am particularly excited about approaches like circuit-breaking (qt’d), which trains models to more deeply avoid behaviors compared to simply minimizing completion.
1
0
20
@TonyWangIV
Tony Wang
1 year
One approach is to relax the need for real examples of failures, which may be computationally expensive to find. Instead, one could find examples of failures that are unrealistic but nonetheless give useful signal when trained against. This idea is known as relaxed adversarial.
2
0
34
@TonyWangIV
Tony Wang
1 year
Roughly 18 months ago, my collaborators and I discovered that supposedly superhuman Go AIs can be defeated by humans using a simple adversarial strategy. In the time since, we've been testing a few baseline techniques for defending against this exploit. Unfortunately, in our new.
@farairesearch
FAR.AI
1 year
🛡 Is AI robustness possible, or are adversarial attacks unavoidable? We tested three defenses to make superhuman Go AIs robust. Our defenses manage to protect against known threats, but unfortunately new adversaries bypass them, sometimes using qualitatively new attacks! 🧵
26
111
768
@TonyWangIV
Tony Wang
1 year
RT @bshlgrs: ARC-AGI’s been hyped over the last week as a benchmark that LLMs can’t solve. This claim triggered my dear coworker Ryan Green….
0
177
0
@TonyWangIV
Tony Wang
1 year
RT @visakanv: The point isn’t to achieve the million views on YouTube . That’s truly just the exhaust coming out of the creative process. T….
0
1
0
@TonyWangIV
Tony Wang
2 years
@robbensinger I filled it out using @TetraspaceWest's really nice web version of the form:
0
0
1
@TonyWangIV
Tony Wang
2 years
Finally got around to filling out @robbensinger's AI x-risk chart!
Tweet media one
@robbensinger
Rob Bensinger ⏹️
2 years
Here are some of my views on AI x-risk. I'm pretty sure these discussions would go way better if there was less "are you in the Rightthink Tribe, or the Wrongthink Tribe?", and more focus on specific claims. Maybe share your own version of this image, and start a conversation?
Tweet media one
1
0
8
@TonyWangIV
Tony Wang
2 years
Very cool newsletter. Was pleasantly surprised that there seem to be people in China in positions of power that take AI safety seriously.
@yawen_duan
Yawen Duan
2 years
"What's going on with AI Safety in China?" – a question I've often encountered in the past year. Check out "AI Safety in China" - a newsletter by Concordia AI, sharing insights and updates on China's AI safety and governance landscape![. 1/n
Tweet media one
0
0
4
@TonyWangIV
Tony Wang
2 years
I will be at ICML from 7/24 to 8/2, presenting on this work, learning new things, and just enjoying Hawaii. Hit me up if you wanna chat or hang out~.
@farairesearch
FAR.AI
2 years
This is Lee Sedol in 2016 playing against AlphaGo. Despite a valiant effort, Lee lost. The AI was just too powerful. But, had Lee known about our ICML 2023 paper, Adversarial Policies Beat Superhuman Go AIs, things might have turned out differently! �
Tweet media one
1
1
12
@TonyWangIV
Tony Wang
2 years
RT @GreatTA1998: What people sometimes forget is, "reality checks" aren't scarce, but support and encouragement is.
0
1
0
@TonyWangIV
Tony Wang
2 years
A refreshingly novel theory of neural scaling laws just dropped. Fun read. (also seems plausible).
@ericjmichaud_
Eric J. Michaud
2 years
Understanding the origin of neural scaling laws and the emergence of new capabilities with scale is key to understanding what deep neural networks are learning. In our new paper, @tegmark, @ZimingLiu11, @uzpg_ and I develop a theory of neural scaling. 🧵:.
0
2
5
@TonyWangIV
Tony Wang
2 years
RT @ARGleave: Even SOTA ML systems are vulnerable to adversarial attack. How can we ensure AI systems are safe and beneficial when the syst….
Tweet card summary image
alignmentforum.org
Even the most advanced contemporary machine learning systems are vulnerable to adversarial attack. The safety community has often assumed adversarial…
0
3
0
@TonyWangIV
Tony Wang
3 years
RT @moreisdifferent: 1/ Trying to signal boost w a short 🧵. Amateur Go player Kellin Pelrine can consistently beat "KataGo", an AI system t….
0
239
0
@TonyWangIV
Tony Wang
3 years
RT @benlandautaylor: 9/ Just because someone is describing a real problem doesn't mean their solutions are any good. Today most movements w….
0
5
0