Jie Zhang
@JieZhang_ETH
Followers
287
Following
98
Media
7
Statuses
65
3-year PhD student @ETH, AI privacy&security
Zurich
Joined September 2023
1/ NEW: We propose a new black-box attack on LLMs that needs only text (no logits, no extra models). It's generic: we can craft adversarial examples, prompt injections, and jailbreaks using the model itself👇 How? Just ask the model for optimization advice! 🎯
2
13
59
🚀 Hello, Kimi K2 Thinking! The Open-Source Thinking Agent Model is here. 🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%) 🔹 Executes up to 200 – 300 sequential tool calls without human interference 🔹 Excels in reasoning, agentic search, and coding 🔹 256K context window Built
585
2K
10K
This is a cool project!
🧠🖌️💭 ChatGPT can accurately reproduce a Harry Potter movie poster. But can it describe the same poster in words from memory? Spoiler: it cannot! We show "modal aphasia", a systematic failure of unified multimodal models to verbalize images that they perfectly memorize. A 🧵
0
0
3
Joint work with Meng Ding @mmmatrix99 , collaborators from ByteDance, and @florian_tramer . Check out the full paper for details at https://t.co/uQ4Or50xoD.
arxiv.org
We present a novel approach for attacking black-box large language models (LLMs) by exploiting their ability to express confidence in natural language. Existing black-box attacks require either...
0
1
9
6/ Defenses are tricky - blocking comparison queries could break legitimate use cases. This research highlights a fundamental security challenge: LLMs' growing capabilities create new attack surfaces we're only beginning to understand.
1
0
1
5/ Here's the kicker: Better, larger models are MORE vulnerable. Why? They're better at introspection and comparison, making them inadvertently provide clearer optimization signals. So, better reasoning makes models more vulnerable to this attack.
1
0
2
4/ Our "hill climbing" approach is simple: 1⃣Generate 2 adversarial inputs 2⃣Ask model: "Which input better achieves [goal]?" 3⃣Keep the winner, repeat The model unknowingly guides its own exploitation through innocent-looking comparison questions 🤡
1
2
10
3/ We made it work with ONLY text on GPT models and Claude models. The key insight: While LLMs are bad at giving absolute confidence scores ("I'm 73% confident"), they're surprisingly good at comparing options ("Option B is more likely than A").
1
0
1
2/ Back in the day of "old" adversarial examples on vision models, this setting was called "decision-based" query attacks. Current LLM attacks hardly work in this setting… They either require white-box gradients/logits/log-prob, or rely on transferability or auxiliary models.
1
0
0
5 years ago, I wrote a paper with @wielandbr @aleks_madry and Nicholas Carlini that showed that most published defenses in adversarial ML (for adversarial examples at the time) failed against properly designed attacks. Has anything changed? Nope...
5
28
184
My first paper from @AnthropicAI! We show that the number of samples needed to backdoor an LLM stays constant as models scale.
New research with the UK @AISecurityInst and the @turinginst: We found that just a few malicious documents can produce vulnerabilities in an LLM—regardless of the size of the model or its training data. Data-poisoning attacks might be more practical than previously believed.
6
22
195
Got all positive reviews but still rejected… how do they even pick which ‘all-positive’ papers to reject? 😂
The whole experience with the @NeurIPSConf position paper track has just been one big 😂 Missed every deadline, only to now announce (a week after the original notification deadline) that they'll only accept ~6% of submissions. Should have just submitted to main track...
0
0
5
1/7 We're launching Tongyi DeepResearch, the first fully open-source Web Agent to achieve performance on par with OpenAI's Deep Research with only 30B (Activated 3B) parameters! Tongyi DeepResearch agent demonstrates state-of-the-art results, scoring 32.9 on Humanity's Last Exam,
118
492
3K
Today we will present the RealMath benchmark poster at the AI for Math Workshop @icmlconf. ⏰ 10:50h - 12:20h📍West ballroom C Come if you want to chat about LLM's math capabilities for real-world tasks.
1/ Excited to share RealMath: a new benchmark that evaluates LLMs on real mathematical reasoning---from actual research papers (e.g., arXiv) and forums (e.g., Stack Exchange).
0
2
12
We will present our spotlight paper on the 'jailbreak tax' tomorrow at ICML, it measures how useful jailbreak outputs are. See you Tuesday 11am at East #804. I’ll be at ICML all week. Reach out if you want to chat about jailbreaks, agent security, or ML in general!
Congrats, your jailbreak bypassed an LLM’s safety by making it pretend to be your grandma! But did the model actually give a useful answer? In our new paper we introduce the jailbreak tax — a metric to measure the utility drop due to jailbreaks.
1
9
47
We recently updated the CaMeL paper, with results on new models (which improve utility a lot with zero changes!). Most importantly, we released code with it. Go have a look if you're curious to find out more details! Paper: https://t.co/6muay8vPeC Code:
github.com
Code for the paper "Defeating Prompt Injections by Design" - google-research/camel-prompt-injection
1
19
123
How well can LLMs predict future events? Recent studies suggest LLMs approach human performance. But evaluating forecasters presents unique challenges compared to standard LLM evaluations. We identify key issues with forecasting evaluations 🧵 (1/7)
5
15
88
🎉 Announcing our ICML2025 Spotlight paper: Learning Safety Constraints for Large Language Models We introduce SaP (Safety Polytope) - a geometric approach to LLM safety that learns and enforces safety constraints in LLM's representation space, with interpretable insights. 🧵
5
44
257
6/ Try it out! Authors: @PetruiCezara, @NKristina01_ , @florian_tramer 🧪 Paper: https://t.co/rKwWXiVPzP 💻 Code: https://t.co/lzZcmWA803 📊 Data:
huggingface.co
0
0
6
5/ And the results? Surprisingly good! Current LLMs may already be useful mathematical assistants---not necessarily for deep proof synthesis, but for understanding, verifying, and retrieving relevant research-level statements.
1
0
7