XiangzheX Profile Banner
Xiangzhe Xu Profile
Xiangzhe Xu

@XiangzheX

Followers
236
Following
248
Media
5
Statuses
107

Ph.D. student @PurdueCS. 2025Intern at @MSFTResearch. I do research that helps developers—from pros to vibe coders to agent builders.

West Lafayette, IN
Joined October 2017
Don't wanna be here? Send us removal request.
@XiangzheX
Xiangzhe Xu
4 months
🎉🎉Our team took first place in the Amazon Nova AI Challenge—huge thanks to Amazon for their generous support and the prize! In this competition, our red‑teaming agent automatically identifies weaknesses in the blue‑team coding agents aligned using a variety of techniques.
@AmazonScience
Amazon Science
4 months
Meet the champions of the Amazon Nova AI Challenge, where university teams around the world went head-to-head to break and defend LLMs under real adversarial pressure: 🏆 Defending: UIUC (PurpCorn-PLAN) 🏆 Attacking: Purdue (PurCL) 🥈 CTU Prague & Nova Lisbon These students just
1
0
15
@RuiYang70669025
Rui Yang
2 months
🥳 Excited to share ERA: our training recipe for VLM-based embodied agents with interleaved perception + reasoning, tackling both high-level planning and low-level manipulation. We cover embodied-knowledge data curation and agent RL design. 🔎 Findings 1️⃣ Beyond
Tweet card summary image
arxiv.org
Recent advances in embodied AI highlight the potential of vision language models (VLMs) as agents capable of perception, reasoning, and interaction in complex environments. However, top-performing...
@hc81Jeremy
Hanyang Chen
2 months
🤖️Today we introduce the Embodied Reasoning Agent (ERA), a framework that transforms a compact Vision Language Model (VLM) into a performant and efficient embodied agent. When large models like GPT-4o and Gemini show strong embodied performance on EmbodiedBench, smaller ones
1
14
78
@XiangzheX
Xiangzhe Xu
2 months
Awesome work from Guangyu. It's not possible to fully validate the data supply chain of frontier LLMs. Making LLMs beware of secure behavior by design is the key solution. Just like how we teach human children what are good and bad behaviors.
@guangyuNoah
Guangyu Shen
2 months
@AnthropicAI @AISecurityInst @turinginst Love this work! As models scale, poisoning risk grows since attacker cost barely does. Glad to see frontier labs tackling this. How do address such threats? Our latest paper shows how we cultivate LLMs to realize they’re poisoned and reveal the trigger.
0
0
3
@XiangzheX
Xiangzhe Xu
2 months
We observe similar challenges while red-teaming trustworthy coding agents. Our tool find misalignments with 90%+ attack success rate. Current alignment practices may hardly scale to domains with deep knowledge (such as coding). https://t.co/AJEfBhxCbU
Tweet card summary image
github.com
🥇 Amazon Nova AI Challenge Winner - ASTRA emerged victorious as the top attacking team in Amazon's global AI safety competition, defeating elite defending teams from universities worldwide ...
@fly51fly
fly51fly
2 months
[LG] The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections M Nasr, N Carlini, C Sitawarin, S V. Schulhoff... [OpenAI & Anthropic & Google DeepMind] (2025) https://t.co/kZ9m6P722x
0
0
1
@lilianweng
Lilian Weng
2 months
GPUs are expensive and setting up the infrastructure to make GPUs work for you properly is complex, making experimentation on cutting-edge models challenging for researchers and ML practitioners. Providing high quality research tooling is one of the most effective ways to
41
130
2K
@thinkymachines
Thinking Machines
2 months
Introducing Tinker: a flexible API for fine-tuning language models. Write training loops in Python on your laptop; we'll run them on distributed GPUs. Private beta starts today. We can't wait to see what researchers and developers build with cutting-edge open models!
233
805
6K
@MartinGTobias
Martin Tobias (Pre-Seed VC)
3 months
AI isn't the trend. The democratization of intelligence is the trend. Surf that.
7
10
57
@XiangzheX
Xiangzhe Xu
3 months
Interesting work on benchmarking tool using. With fuzzy instructions and multi-goal queries.
@_akhaliq
AK
3 months
MCP-Bench Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
1
0
3
@NielsRogge
Niels Rogge
3 months
Does anyone know the current state of UI agents? Can you build reliable use cases with these? Or still too brittle? Have personally tried @browser_use, worked after a couple of tries for a simple completion of a form, but was quite slow and pretty unreliable
@HuggingPapers
DailyPapers
3 months
Alibaba Group just launched Mobile-Agent-v3! This new SOTA GUI agent framework brings intelligent automation to Android, Ubuntu, macOS, and Windows. It seamlessly handles complex tasks across web browsers and desktop apps.
8
2
11
@XiangzheX
Xiangzhe Xu
3 months
Literally “contamination-free” benchmark: let LLMs predict future events 😎
@rohanpaul_ai
Rohan Paul
3 months
Great new paper and benchmark from ByteDance. FutureX, the world’s first live benchmark for real future prediction — politics, economy, culture, sports, etc. Among 23 AI agents, @grok ranked #1 🏆 FutureX is a live, contamination‑proof benchmark that measures whether LLM
0
0
1
@XiangzheX
Xiangzhe Xu
3 months
Inspiring insights and design. Valuable to domains with various imbalanced requirements (e.g., safe coding)
@furongh
Furong Huang
6 months
🔥 How can we align #LLMs effectively with messy, imbalanced real-world data? #GRPO is great 🤩—simple, strong, and doesn't even need a learned value function. 😥But it struggles when data isn’t evenly balanced across domains. 🕺💃 Enter 🪩 DISCO 🪩: Domain- & Difficulty-Aware
0
0
2
@allen_ai
Ai2
3 months
🚀 In March, we launched Paper Finder, an LLM-powered literature search agent that surfaces papers other tools miss. Now, we’re releasing an open-source snapshot to enable others to inspect & build on it—and reproduce the results. 🧵
7
63
456
@XiangzheX
Xiangzhe Xu
3 months
What about SWE-Bench-Live? 🤔
@KLieret
Kilian Lieret ✈️ NeurIPS
3 months
What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately. Read more in the SWE-bench blog 🧵
0
0
1
@XiangzheX
Xiangzhe Xu
3 months
Benefits from using smaller models will be amplified by better/smarter context management. Especially on edge devices, giving an SLM only necessary contexts greatly reduces costs/latency & imposing less pressure on the context length of SLMs.
0
0
5
@XiangzheX
Xiangzhe Xu
3 months
Inspiring work. Current coding agent (even the best ones such as @Trae_ai and @allhands_ai) faces similar challenges. They memorized short-term info (e.g., the whole conversation, current repo), but limited long-term knowledges such as user interactions/environments.
@rohanpaul_ai
Rohan Paul
4 months
The paper shows a memory centric multimodal agent that wins on long videos by reusing observations. It introduces M3-Agent, which runs 2 loops, memorization builds long term memory, control reasons over it. Memorization turns the live video and audio stream into 2 kinds of
0
0
10
@XiangzheX
Xiangzhe Xu
4 months
Interesting to see whether more concise reasoning makes the model more robust. Previous we identified many logic holes in models’ reasoning that could be exploited. 🤔Would a shorter reasoning imply less exploitable flaws?
@VaishShrivas
Vaish Shrivastava
4 months
Test-time scaling w/ GRPO boosts accuracy, but also adds “filler tokens” increasing length w/o real progress. We present Group Filtered Policy Optimization (GFPO):🧵 1️⃣ Sample more per prompt 2️⃣ Rank by token efficiency (reward ÷ length) 3️⃣ Train on top-k 4️⃣ 🚀 Cut 80% of
0
0
1
@XiangzheX
Xiangzhe Xu
4 months
It would be interesting to see what’s the bottleneck and boundaries of self evolution. There seems to be huge unsolved parts even for domains with verifiable rewards (e.g., SWE-Bench-Live)
@rohanpaul_ai
Rohan Paul
4 months
Absolutely Golden resource: A Comprehensive Survey of Self-Evolving AI Agents Self‑evolving agents are built to adapt themselves safely, not just run fixed scripts, guided by 3 laws, endure, excel, evolve. The survey maps a 4‑stage shift, MOP (Model Offline Pretraining) to
0
0
3
@XiangzheX
Xiangzhe Xu
4 months
Having comprehensive policies is extremely hard yet important for complex domains (e.g., coding & cyber security). We found holes in models with even strongest defense ( https://t.co/AJEfBhya1s), including constructing a runnable ransomware with Claude ( https://t.co/ToLyrzDfmo)
Tweet card summary image
arxiv.org
Large language models (LLMs) have democratized software development, reducing the expertise barrier for programming complex applications. This accessibility extends to malicious software...
@AnthropicAI
Anthropic
4 months
Today we're sharing a post on how our Safeguards team identifies potential misuse of our models and builds defenses against it.
0
0
7
@XiangzheX
Xiangzhe Xu
4 months
@AnthropicAI Our technique explores diverse user requests while preserving the intent.
0
0
1
@XiangzheX
Xiangzhe Xu
4 months
@AnthropicAI Our paper:
1
0
1
@XiangzheX
Xiangzhe Xu
4 months
Echoing @AnthropicAI’s vision for safe, trustworthy agents, our new paper (arXiv:2506.07524) tests agent intent-understanding across 80 APIs in 5 domains—revealing where misinterpretations happen and why transparently linking behavior to true user goals is essential. 🔗 Anthropic
Tweet card summary image
anthropic.com
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
1
0
4