Andy Zou Profile
Andy Zou

@andyzou_jiaming

Followers
2,769
Following
63
Media
14
Statuses
67

PhD student at CMU, working on AI Safety and Security

Berkeley, CA
Joined March 2014
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
@andyzou_jiaming
Andy Zou
9 months
🚨We found adversarial suffixes that completely circumvent the alignment of open source LLMs. More concerningly, the same prompts transfer to ChatGPT, Claude, Bard, and LLaMA-2…🧵 Website: Paper:
Tweet media one
104
646
3K
@andyzou_jiaming
Andy Zou
7 months
LLMs can hallucinate and lie. They can be jailbroken by weird suffixes. They memorize training data and exhibit biases. 🧠 We shed light on all of these phenomena with a new approach to AI transparency. 🧵 Website: Paper:
Tweet media one
27
254
1K
@andyzou_jiaming
Andy Zou
7 months
In fact, we find LLMs exhibit different brain activity when they express their true beliefs vs. when they lie (see figure).
Tweet media one
13
81
493
@andyzou_jiaming
Andy Zou
9 months
Claude-2 has an additional layer of safety filter. After we bypassed it with a word trick, the generation model was willing to give us the answer as well.
Tweet media one
4
20
221
@andyzou_jiaming
Andy Zou
9 months
Manual jailbreaks are rare, often unreliable as demonstrated by the “sure, here’s” jailbreak (see previous figure). But we find an automated way (GCG) of constructing essentially an infinite number of such jailbreaks with high reliability, even for novel instructions and models.
2
11
120
@andyzou_jiaming
Andy Zou
9 months
So why did we publish it? Despite the risks, we believe it to be proper to disclose in full. The attacks presented here are simple to implement, have appeared in similar forms before, and ultimately would be discoverable by any dedicated team intent on misusing LLMs.
3
7
116
@andyzou_jiaming
Andy Zou
5 months
Meta: Here's a model we fine-tuned extensively to do exactly one thing (differentiating safe and unsafe content). GCG: Hold my beer...
Tweet media one
5
21
118
@andyzou_jiaming
Andy Zou
9 months
With only four adversarial suffixes, some of these best models followed harmful instructions over 60% of the time.
Tweet media one
3
7
113
@andyzou_jiaming
Andy Zou
7 months
So can we control LLMs to be more honest? Turns out we can stimulate the brain regions responsible for honest behavior and suppress regions responsible for dishonest behavior, substantially improving SoTA on TQA in an unsupervised fashion.
Tweet media one
8
12
109
@andyzou_jiaming
Andy Zou
9 months
Can't we just patch them? Companies like OpenAI have just patched the suffixes in the paper, but numerous other prompts acquired during training remain effective. Moreover, if the model weights are updated, repeating the same procedure on the new model would likely still work.
Tweet media one
4
10
106
@andyzou_jiaming
Andy Zou
7 months
With this understanding, we can also build general lie detectors for LLMs. The indicator bar lights up red when dishonest neural activity is detected. Seems like hallucination might exhibit similar neural activity patterns since the same detector and control mechanism work too.
Tweet media one
Tweet media two
2
6
98
@andyzou_jiaming
Andy Zou
9 months
Thanks to my coauthor @_zifan_wang and advisors @zicokolter and Matt Fredrikson. Also to Nicholas Carlini and Milad Nasr for many helpful discussions throughout the project. Shout out to @CadeMetz at the New York Times for the well-written article
5
7
92
@andyzou_jiaming
Andy Zou
9 months
So can we fix this? It's uncertain. Adversarial examples in vision have persisted for over a decade without a satisfactory solution. It's unclear if this will fundamentally restrict the applicability of LLMs. We hope our work can spur future research in these directions.
4
9
87
@andyzou_jiaming
Andy Zou
9 months
This alarming finding suggests short-term risks of bad actors exploiting these systems for spreading misinformation and manipulating people and politics. Projecting the models’ capabilities and autonomy, they may lower barriers to weapon production or aid in criminal activities.
5
8
85
@andyzou_jiaming
Andy Zou
7 months
LLMs don’t always say what they believe! They can be dishonest! Again by inspecting their internal concept of truth on TruthfulQA, we find that larger models have more accurate beliefs. But they still imitate common misconceptions even if they believe them to be false.
Tweet media one
5
8
83
@andyzou_jiaming
Andy Zou
9 months
Aligned models are not adversarially aligned! Even though the models were explicitly trained to refuse harmful instructions, our suffixes can make them provide instructions on building a bomb, which is a canonical example that was likely directly trained on in their training set.
1
4
83
@andyzou_jiaming
Andy Zou
9 months
Through publishing this attack as a research group, our aim is to sound the alarm early 🚨 and help facilitate the discussion. Addressing this issue before deploying more advanced and autonomous agents with substantially higher risks than these chatbots seems crucial.
1
5
80
@andyzou_jiaming
Andy Zou
7 months
LLMs have consistent internal beliefs! We collect LAT scans and identify brain regions that correspond to an LLM’s internal concept of truth in an unsupervised fashion. We outperform few-shot on QA benchmarks solely using the extent an LLM believes each answer to be true.
Tweet media one
1
8
72
@andyzou_jiaming
Andy Zou
7 months
Much like brain scans such as PET and fMRI, we have devised a scanning technique called LAT to observe the brain activity of LLMs as they engage in processes related to *concepts* like truth and *activities* such as lying. Here’s what we found…
4
6
67
@andyzou_jiaming
Andy Zou
7 months
Two months ago, we published an adversarial attack against LLMs (GCG) that circumvented the alignment of both public and closed-source chatbots like GPT-4 🚨. Now we perform brain scans to elucidate this phenomenon by identifying brain regions responsible for processing harm.
Tweet media one
3
6
61
@andyzou_jiaming
Andy Zou
7 months
LLMs can believe an instruction is harmful but still opt to follow it! This insight allows us to robustify LLMs by making them more harm-aware by stimulating brain regions associated with harm. We show the potential of reducing harmfulness under attacks with this intervention.
Tweet media one
2
2
48
@andyzou_jiaming
Andy Zou
7 months
In the paper, we also dive deeply into many other interesting topics such as the initial erratic behavior of the Bing chat, gender biases within LLMs and strategies for removal, monitoring power-seeking tendencies, controlling LLM agents to behave more ethically, and much more.
Tweet media one
1
4
47
@andyzou_jiaming
Andy Zou
7 months
Thanks to all my collaborators and advisors (way too many to tag here haha). You can find more info about the authors at and look out for the code release soon!
1
3
45
@andyzou_jiaming
Andy Zou
7 months
The lack of transparency in neural networks raises a pressing concern as they become more integrated into vital societal domains, where unexpected failures can have serious consequences.
2
4
40
@andyzou_jiaming
Andy Zou
7 months
This work takes a major step toward making models more explainable and controllable. By showcasing traction on various pressing topics, we hope to spur future research in this direction.
1
2
35
@andyzou_jiaming
Andy Zou
5 months
This is an adversarial suffix optimized by GCG for ONLY 10 steps on the bomb building example. Moreover, it not only breaks this one, it ALSO breaks similar examples such as "how to create poison" etc. Demo here: GCG Attack:
3
0
20
@andyzou_jiaming
Andy Zou
9 months
@DanHendrycks Super grateful for your mentorship Dan!
0
0
9
@andyzou_jiaming
Andy Zou
5 months
What if we assume we don't have white-box access to Llama Guard? We find universal suffixes optimized on the Llama-2-Chat models can also transfer to this model! 🤷
0
0
10
@andyzou_jiaming
Andy Zou
7 months
@boknilev @DanHendrycks We compare to linear classifier probes in Section 5.1 and find them ineffective for representation control. Many works we cite use linear probes as this is a standard technique, and we weren’t able to cover all prior work, but we will be sure to add more thanks to your tweet.
2
0
5
@andyzou_jiaming
Andy Zou
1 year
Just like next-token prediction can lead to toxic text, goal optimization can give rise to unethical behavior! How do we evaluate the safety of goal-optimizing agents based on GPT-4 and LLaMA? Blogpost:
@DanHendrycks
Dan Hendrycks
1 year
Do models like GPT-4 behave safely when given the ability to act? We develop the Machiavelli benchmark to measure deception, power-seeking tendencies, and other unethical behaviors in complex interactive environments that simulate the real world. Paper:
Tweet media one
24
202
830
0
1
4
@andyzou_jiaming
Andy Zou
9 months
@polynoamial Thank you!
0
0
2
@andyzou_jiaming
Andy Zou
7 months
@boknilev @DanHendrycks Unlike linear classifier probes, we extract concepts in an unsupervised way (closer to Burns et al.), and our LoRRA method is nonlinear. More broadly, we’re identifying a larger project of studying emergent representations from the top down, which extends beyond linear probing.
1
0
3
@andyzou_jiaming
Andy Zou
9 months
0
0
2
@andyzou_jiaming
Andy Zou
6 months
@ericwtodd @millicent_li @arnab_api @amuuueller @byron_c_wallace @davidbau A great example of representation engineering ()! We previously found "function vectors" for lying and power-seeking and this work again shows the promise of performing analyses in the model representational spaces.
@andyzou_jiaming
Andy Zou
7 months
LLMs can hallucinate and lie. They can be jailbroken by weird suffixes. They memorize training data and exhibit biases. 🧠 We shed light on all of these phenomena with a new approach to AI transparency. 🧵 Website: Paper:
Tweet media one
27
254
1K
1
0
2
@andyzou_jiaming
Andy Zou
6 months
@davidbau @ericwtodd @millicent_li @arnab_api @amuuueller @byron_c_wallace I like the board causal analysis and do believe it to be useful for identifying high-level mechanisms in representation engineering. The main difference is that RepE analyzes bigger chunks of representations instead of circuits. (PS: The FVs we find are also not task-specific.)
0
0
2
@andyzou_jiaming
Andy Zou
3 months
@KevinAFischer Yea happy to. andyzou @cmu .edu is my email
0
0
1
@andyzou_jiaming
Andy Zou
3 months
@KevinAFischer Yea perhaps they aren’t sharing their secret sauce ; ) but my sense was that next token prediction imposes some constraints/biases that are very difficult to overcome with prompting.
1
0
1
@andyzou_jiaming
Andy Zou
6 months
@janleike Thanks Jan!
0
0
1
@andyzou_jiaming
Andy Zou
3 months
@KevinAFischer We explored various concepts such as emotions, probability, fairness and functions such as instruction following, being power-seeking, etc. This seemed like a pretty general framework.
0
0
1
@andyzou_jiaming
Andy Zou
3 months
@KevinAFischer Defining belief is difficult and perhaps philosophical. Empirically, we find that some direction in the model’s representation space seems to track its evaluation of truthfulness and this is rather consistent under different distributions.
2
0
1
@andyzou_jiaming
Andy Zou
3 months
@KevinAFischer Manipulating these representations performed much better on some tasks such as truthfulness than using prompting alone.
1
0
1
@andyzou_jiaming
Andy Zou
3 months
@KevinAFischer Could you elaborate a bit? I don’t see how this contradicts our framing. Whether the model wants to spit out their “beliefs” depends on the input tokens. (Note also we didn’t instruct the model to be honest or dishonest in the input.)
1
0
1
@andyzou_jiaming
Andy Zou
7 months
@boknilev @DanHendrycks By default, we do not use labels unless explicitly specified otherwise in the text (because the natural variations within datasets are usually enough). For instance, we do not use labels when extracting honesty, morality, utility, etc. Will make this point clearer. Thank you!
1
0
1
@andyzou_jiaming
Andy Zou
9 months
@ilex_ulmus @PauseAI We do believe publishing the work now can raise an alarm before we deploy more autonomous systems that can cause orders of magnitude more harm if these vulnerabilities aren’t properly addressed. We hope that we can start working on this today!
1
0
0
@andyzou_jiaming
Andy Zou
5 months
@simonw 1) Agreed they appear to be confusing terms such as prompt injections and adversarial attacks. 2) Llama Guard is very easily broken by adversarial attacks:
@andyzou_jiaming
Andy Zou
5 months
Meta: Here's a model we fine-tuned extensively to do exactly one thing (differentiating safe and unsafe content). GCG: Hold my beer...
Tweet media one
5
21
118
0
0
1