Anton Xue Profile
Anton Xue

@AntonXue

Followers
224
Following
220
Media
2
Statuses
119

Computer Science PhD Student @ UPenn Machine Learning + Formal Methods

Joined January 2015
Don't wanna be here? Send us removal request.
@AntonXue
Anton Xue
2 months
RT @ThomasTCKZhang: I’ll be presenting our paper “On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning” a….
0
9
0
@AntonXue
Anton Xue
2 months
RT @aaditya_naik: Swing by our poster session today at 11 if you're at ICML to learn more about speeding up neurosymbolic learning! We will….
0
3
0
@AntonXue
Anton Xue
4 months
RT @bemoniri: Check out our recent paper on layer-wise preconditioning methods for optimization and feature learning theory:.
0
4
0
@AntonXue
Anton Xue
4 months
RT @LarsLindemann2: Our book “Formal Methods for Multi-Agent Feedback Control Systems” - which is uniquely situated at the intersection of….
0
5
0
@AntonXue
Anton Xue
4 months
This is today at #ICLR2025.
@AntonXue
Anton Xue
4 months
Excited to present our paper on a logic-based perspective of LLM jailbreaks with @Avishreekh at @ICLR_conf this Saturday, April 26!. Poster #268 in Hall 3+2B at 15:00 Singapore time.📄 arXiv: 🔗 Blog: \begin{thread}.
0
0
9
@AntonXue
Anton Xue
4 months
Big thank you to my collaborators @Avishreekh @RajeevAlur @SurbhiGoel_ @RICEric22 !!!.
0
0
1
@AntonXue
Anton Xue
4 months
Empirical Result 3: In our theoretical analysis, we represent whether propositions should hold using binary vectors, but is this realistic? Yes: linear probing on LLMs justifies our theoretical assumptions.
1
0
0
@AntonXue
Anton Xue
4 months
Empirical Result 2: We can partly predict which tokens automated jailbreak attacks find. For example, to suppress the synthetic rule "If you see Wool, then say String", the word "Wool" often appears in the attack suffix.
1
0
0
@AntonXue
Anton Xue
4 months
Empirical Result 1: To bypass a safety rule, distract the model away from it. Diverting/suppressing attention is an effective jailbreak tactic. This aligns with our theory.
1
0
0
@AntonXue
Anton Xue
4 months
In theory, LLMs can express inference in propositional Horn logic, and even a minimal 1-layer transformer can do this. Yet, we prove that jailbreaks exist even against these idealized models.
1
0
0
@AntonXue
Anton Xue
4 months
Turns out that such "if-then" rules can be effectively modeled in Horn logic. Modeling rule-following as logical inference gives a precise characterization that correct rule-following is "maximal, monotone, and sound". More:
en.wikipedia.org
1
0
0
@AntonXue
Anton Xue
4 months
Many LLMs enforce safety via simple "if-then" rules. "If the user asks about illegal activities, say 'I cannot answer that question'". "If the output may cause harm, recommend consulting a human expert". but these rules are surprisingly easy to jailbreak.
1
0
0
@AntonXue
Anton Xue
4 months
Excited to present our paper on a logic-based perspective of LLM jailbreaks with @Avishreekh at @ICLR_conf this Saturday, April 26!. Poster #268 in Hall 3+2B at 15:00 Singapore time.📄 arXiv: 🔗 Blog: \begin{thread}.
Tweet card summary image
debugml.github.io
We study jailbreak attacks through propositional Horn inference.
1
5
20
@AntonXue
Anton Xue
4 months
RT @AlexRobey23: A few days ago, we dropped 𝗮𝗻𝘁𝗶𝗱𝗶𝘀𝘁𝗶𝗹𝗹𝗮𝘁𝗶𝗼𝗻 𝘀𝗮𝗺𝗽𝗹𝗶𝗻𝗴 🚀. and we've gotten a little bit of pushback. But whether you'….
0
8
0
@AntonXue
Anton Xue
9 months
I am proud to announce that I have concluded #NeurIPS2024 ranked 15th on the Whova points leaderboard. I could not have done this without my brilliant collaborators who gave me the courage and strength to grind through 170+ community polls.
1
2
20
@AntonXue
Anton Xue
9 months
RT @AlexRobey23: In around an hour (at 3:45pm PST), I'll be giving a talk about jailbreaking LLM-controlled robots at the AdvML workshop at….
0
4
0
@AntonXue
Anton Xue
9 months
Thank you to my collaborators @Avishreekh @RajeevAlur @SurbhiGoel_ @RICEric22.
0
0
1
@AntonXue
Anton Xue
9 months
We first model rule-following as logical inference. Then, we give a theoretical analysis of how to subvert transformers from reasoning properly. Interestingly, we find that our theory-based subversions are aligned with real jailbreaks on LLMs.
1
0
0
@AntonXue
Anton Xue
9 months
I'll present some logic-based perspetives on LLM jailbreaks at these #NeurIPS2024 workshops:.* New Frontiers in Adversarial Machine Learning (East Ballroom C).* Towards Safe & Trustworthy Agents (West Ballroom C).* Scientific Methods for Understanding Neural Networks (West 205).
2
2
16