Xiangzhe Xu
@XiangzheX
Followers
236
Following
248
Media
5
Statuses
107
Ph.D. student @PurdueCS. 2025Intern at @MSFTResearch. I do research that helps developers—from pros to vibe coders to agent builders.
West Lafayette, IN
Joined October 2017
🎉🎉Our team took first place in the Amazon Nova AI Challenge—huge thanks to Amazon for their generous support and the prize! In this competition, our red‑teaming agent automatically identifies weaknesses in the blue‑team coding agents aligned using a variety of techniques.
Meet the champions of the Amazon Nova AI Challenge, where university teams around the world went head-to-head to break and defend LLMs under real adversarial pressure: 🏆 Defending: UIUC (PurpCorn-PLAN) 🏆 Attacking: Purdue (PurCL) 🥈 CTU Prague & Nova Lisbon These students just
1
0
15
🥳 Excited to share ERA: our training recipe for VLM-based embodied agents with interleaved perception + reasoning, tackling both high-level planning and low-level manipulation. We cover embodied-knowledge data curation and agent RL design. 🔎 Findings 1️⃣ Beyond
arxiv.org
Recent advances in embodied AI highlight the potential of vision language models (VLMs) as agents capable of perception, reasoning, and interaction in complex environments. However, top-performing...
🤖️Today we introduce the Embodied Reasoning Agent (ERA), a framework that transforms a compact Vision Language Model (VLM) into a performant and efficient embodied agent. When large models like GPT-4o and Gemini show strong embodied performance on EmbodiedBench, smaller ones
1
14
78
Awesome work from Guangyu. It's not possible to fully validate the data supply chain of frontier LLMs. Making LLMs beware of secure behavior by design is the key solution. Just like how we teach human children what are good and bad behaviors.
@AnthropicAI @AISecurityInst @turinginst Love this work! As models scale, poisoning risk grows since attacker cost barely does. Glad to see frontier labs tackling this. How do address such threats? Our latest paper shows how we cultivate LLMs to realize they’re poisoned and reveal the trigger.
0
0
3
We observe similar challenges while red-teaming trustworthy coding agents. Our tool find misalignments with 90%+ attack success rate. Current alignment practices may hardly scale to domains with deep knowledge (such as coding). https://t.co/AJEfBhxCbU
github.com
🥇 Amazon Nova AI Challenge Winner - ASTRA emerged victorious as the top attacking team in Amazon's global AI safety competition, defeating elite defending teams from universities worldwide ...
[LG] The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections M Nasr, N Carlini, C Sitawarin, S V. Schulhoff... [OpenAI & Anthropic & Google DeepMind] (2025) https://t.co/kZ9m6P722x
0
0
1
GPUs are expensive and setting up the infrastructure to make GPUs work for you properly is complex, making experimentation on cutting-edge models challenging for researchers and ML practitioners. Providing high quality research tooling is one of the most effective ways to
41
130
2K
Introducing Tinker: a flexible API for fine-tuning language models. Write training loops in Python on your laptop; we'll run them on distributed GPUs. Private beta starts today. We can't wait to see what researchers and developers build with cutting-edge open models!
233
805
6K
AI isn't the trend. The democratization of intelligence is the trend. Surf that.
7
10
57
Does anyone know the current state of UI agents? Can you build reliable use cases with these? Or still too brittle? Have personally tried @browser_use, worked after a couple of tries for a simple completion of a form, but was quite slow and pretty unreliable
Alibaba Group just launched Mobile-Agent-v3! This new SOTA GUI agent framework brings intelligent automation to Android, Ubuntu, macOS, and Windows. It seamlessly handles complex tasks across web browsers and desktop apps.
8
2
11
Literally “contamination-free” benchmark: let LLMs predict future events 😎
Great new paper and benchmark from ByteDance. FutureX, the world’s first live benchmark for real future prediction — politics, economy, culture, sports, etc. Among 23 AI agents, @grok ranked #1 🏆 FutureX is a live, contamination‑proof benchmark that measures whether LLM
0
0
1
Inspiring insights and design. Valuable to domains with various imbalanced requirements (e.g., safe coding)
🔥 How can we align #LLMs effectively with messy, imbalanced real-world data? #GRPO is great 🤩—simple, strong, and doesn't even need a learned value function. 😥But it struggles when data isn’t evenly balanced across domains. 🕺💃 Enter 🪩 DISCO 🪩: Domain- & Difficulty-Aware
0
0
2
🚀 In March, we launched Paper Finder, an LLM-powered literature search agent that surfaces papers other tools miss. Now, we’re releasing an open-source snapshot to enable others to inspect & build on it—and reproduce the results. 🧵
7
63
456
Benefits from using smaller models will be amplified by better/smarter context management. Especially on edge devices, giving an SLM only necessary contexts greatly reduces costs/latency & imposing less pressure on the context length of SLMs.
0
0
5
Inspiring work. Current coding agent (even the best ones such as @Trae_ai and @allhands_ai) faces similar challenges. They memorized short-term info (e.g., the whole conversation, current repo), but limited long-term knowledges such as user interactions/environments.
The paper shows a memory centric multimodal agent that wins on long videos by reusing observations. It introduces M3-Agent, which runs 2 loops, memorization builds long term memory, control reasons over it. Memorization turns the live video and audio stream into 2 kinds of
0
0
10
Interesting to see whether more concise reasoning makes the model more robust. Previous we identified many logic holes in models’ reasoning that could be exploited. 🤔Would a shorter reasoning imply less exploitable flaws?
Test-time scaling w/ GRPO boosts accuracy, but also adds “filler tokens” increasing length w/o real progress. We present Group Filtered Policy Optimization (GFPO):🧵 1️⃣ Sample more per prompt 2️⃣ Rank by token efficiency (reward ÷ length) 3️⃣ Train on top-k 4️⃣ 🚀 Cut 80% of
0
0
1
It would be interesting to see what’s the bottleneck and boundaries of self evolution. There seems to be huge unsolved parts even for domains with verifiable rewards (e.g., SWE-Bench-Live)
Absolutely Golden resource: A Comprehensive Survey of Self-Evolving AI Agents Self‑evolving agents are built to adapt themselves safely, not just run fixed scripts, guided by 3 laws, endure, excel, evolve. The survey maps a 4‑stage shift, MOP (Model Offline Pretraining) to
0
0
3
Having comprehensive policies is extremely hard yet important for complex domains (e.g., coding & cyber security). We found holes in models with even strongest defense ( https://t.co/AJEfBhya1s), including constructing a runnable ransomware with Claude ( https://t.co/ToLyrzDfmo)
arxiv.org
Large language models (LLMs) have democratized software development, reducing the expertise barrier for programming complex applications. This accessibility extends to malicious software...
Today we're sharing a post on how our Safeguards team identifies potential misuse of our models and builds defenses against it.
0
0
7
Echoing @AnthropicAI’s vision for safe, trustworthy agents, our new paper (arXiv:2506.07524) tests agent intent-understanding across 80 APIs in 5 domains—revealing where misinterpretations happen and why transparently linking behavior to true user goals is essential. 🔗 Anthropic
anthropic.com
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
1
0
4