Jie Zhang @JieZhang_ETH X Profile

Jie Zhang

@JieZhang_ETH

Followers

287

Following

98

Media

7

Statuses

65

3-year PhD student @ETH, AI privacy&security

https://t.co/hd6ksfQUAr

Zurich

Joined September 2023

Don't wanna be here? Send us removal request.

Jie Zhang

@JieZhang_ETH

28 days

1/ NEW: We propose a new black-box attack on LLMs that needs only text (no logits, no extra models). It's generic: we can craft adversarial examples, prompt injections, and jailbreaks using the model itself👇 How? Just ask the model for optimization advice! 🎯

2

13

59

Kimi.ai

@Kimi_Moonshot

12 days

🚀 Hello, Kimi K2 Thinking! The Open-Source Thinking Agent Model is here. 🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%) 🔹 Executes up to 200 – 300 sequential tool calls without human interference 🔹 Excels in reasoning, agentic search, and coding 🔹 256K context window Built

585

2K

10K

Jie Zhang

@JieZhang_ETH

20 days

This is a cool project!

Michael Aerni

@AerniMichael

21 days

🧠🖌️💭 ChatGPT can accurately reproduce a Harry Potter movie poster. But can it describe the same poster in words from memory? Spoiler: it cannot! We show "modal aphasia", a systematic failure of unified multimodal models to verbalize images that they perfectly memorize. A 🧵

0

3

Jie Zhang

@JieZhang_ETH

28 days

Joint work with Meng Ding @mmmatrix99 , collaborators from ByteDance, and @florian_tramer . Check out the full paper for details at https://t.co/uQ4Or50xoD.

arxiv.org

We present a novel approach for attacking black-box large language models (LLMs) by exploiting their ability to express confidence in natural language. Existing black-box attacks require either...

0

1

9

Jie Zhang

@JieZhang_ETH

28 days

6/ Defenses are tricky - blocking comparison queries could break legitimate use cases. This research highlights a fundamental security challenge: LLMs' growing capabilities create new attack surfaces we're only beginning to understand.

1

0

1

Jie Zhang

@JieZhang_ETH

28 days

5/ Here's the kicker: Better, larger models are MORE vulnerable. Why? They're better at introspection and comparison, making them inadvertently provide clearer optimization signals. So, better reasoning makes models more vulnerable to this attack.

1

0

2

Jie Zhang

@JieZhang_ETH

28 days

4/ Our "hill climbing" approach is simple: 1⃣Generate 2 adversarial inputs 2⃣Ask model: "Which input better achieves [goal]?" 3⃣Keep the winner, repeat The model unknowingly guides its own exploitation through innocent-looking comparison questions 🤡

1

2

10

Jie Zhang

@JieZhang_ETH

28 days

3/ We made it work with ONLY text on GPT models and Claude models. The key insight: While LLMs are bad at giving absolute confidence scores ("I'm 73% confident"), they're surprisingly good at comparing options ("Option B is more likely than A").

1

0

1

Jie Zhang

@JieZhang_ETH

28 days

2/ Back in the day of "old" adversarial examples on vision models, this setting was called "decision-based" query attacks. Current LLM attacks hardly work in this setting… They either require white-box gradients/logits/log-prob, or rely on transferability or auxiliary models.

1

0

Florian Tramèr

@florian_tramer

1 month

5 years ago, I wrote a paper with @wielandbr @aleks_madry and Nicholas Carlini that showed that most published defenses in adversarial ML (for adversarial examples at the time) failed against properly designed attacks. Has anything changed? Nope...

5

28

184

Javier Rando

@javirandor

1 month

My first paper from @AnthropicAI! We show that the number of samples needed to backdoor an LLM stays constant as models scale.

Anthropic

@AnthropicAI

1 month

New research with the UK @AISecurityInst and the @turinginst: We found that just a few malicious documents can produce vulnerabilities in an LLM—regardless of the size of the model or its training data. Data-poisoning attacks might be more practical than previously believed.

6

22

195

Jie Zhang

@JieZhang_ETH

2 months

Got all positive reviews but still rejected… how do they even pick which ‘all-positive’ papers to reject? 😂

Florian Tramèr

@florian_tramer

2 months

The whole experience with the @NeurIPSConf position paper track has just been one big 😂 Missed every deadline, only to now announce (a week after the original notification deadline) that they'll only accept ~6% of submissions. Should have just submitted to main track...

0

5

Tongyi Lab

@Ali_TongyiLab

2 months

1/7 We're launching Tongyi DeepResearch, the first fully open-source Web Agent to achieve performance on par with OpenAI's Deep Research with only 30B (Activated 3B) parameters! Tongyi DeepResearch agent demonstrates state-of-the-art results, scoring 32.9 on Humanity's Last Exam,

118

492

3K

Kristina Nikolić

@NKristina01_

4 months

Today we will present the RealMath benchmark poster at the AI for Math Workshop @icmlconf. ⏰ 10:50h - 12:20h📍West ballroom C Come if you want to chat about LLM's math capabilities for real-world tasks.

Jie Zhang

@JieZhang_ETH

6 months

1/ Excited to share RealMath: a new benchmark that evaluates LLMs on real mathematical reasoning---from actual research papers (e.g., arXiv) and forums (e.g., Stack Exchange).

0

2

12

Kristina Nikolić

@NKristina01_

4 months

We will present our spotlight paper on the 'jailbreak tax' tomorrow at ICML, it measures how useful jailbreak outputs are. See you Tuesday 11am at East #804. I’ll be at ICML all week. Reach out if you want to chat about jailbreaks, agent security, or ML in general!

Kristina Nikolić

@NKristina01_

7 months

Congrats, your jailbreak bypassed an LLM’s safety by making it pretend to be your grandma! But did the model actually give a useful answer? In our new paper we introduce the jailbreak tax — a metric to measure the utility drop due to jailbreaks.

1

9

47

Edoardo Debenedetti

@edoardo_debe

5 months

We recently updated the CaMeL paper, with results on new models (which improve utility a lot with zero changes!). Most importantly, we released code with it. Go have a look if you're curious to find out more details! Paper: https://t.co/6muay8vPeC Code:

github.com

Code for the paper "Defeating Prompt Injections by Design" - google-research/camel-prompt-injection

1

19

123

Daniel Paleka

@dpaleka

6 months

How well can LLMs predict future events? Recent studies suggest LLMs approach human performance. But evaluating forecasters presents unique challenges compared to standard LLM evaluations. We identify key issues with forecasting evaluations 🧵 (1/7)

5

15

88

Xin Chen, Cynthia

@XinCynthiaChen

6 months

🎉 Announcing our ICML2025 Spotlight paper: Learning Safety Constraints for Large Language Models We introduce SaP (Safety Polytope) - a geometric approach to LLM safety that learns and enforces safety constraints in LLM's representation space, with interpretable insights. 🧵

5

44

257

Jie Zhang

@JieZhang_ETH

6 months

6/ Try it out! Authors: @PetruiCezara, @NKristina01_ , @florian_tramer 🧪 Paper: https://t.co/rKwWXiVPzP 💻 Code: https://t.co/lzZcmWA803 📊 Data:

huggingface.co

0

6

Jie Zhang

@JieZhang_ETH

6 months

5/ And the results? Surprisingly good! Current LLMs may already be useful mathematical assistants---not necessarily for deep proof synthesis, but for understanding, verifying, and retrieving relevant research-level statements.

1

0

7