Xiangzhe Xu @XiangzheX X Profile

Xiangzhe Xu

@XiangzheX

Followers

236

Following

248

Media

5

Statuses

107

Ph.D. student @PurdueCS. 2025Intern at @MSFTResearch. I do research that helps developers—from pros to vibe coders to agent builders.

https://t.co/7hNOIimAAt

West Lafayette, IN

Joined October 2017

Don't wanna be here? Send us removal request.

Xiangzhe Xu

@XiangzheX

4 months

🎉🎉Our team took first place in the Amazon Nova AI Challenge—huge thanks to Amazon for their generous support and the prize! In this competition, our red‑teaming agent automatically identifies weaknesses in the blue‑team coding agents aligned using a variety of techniques.

Amazon Science

@AmazonScience

4 months

Meet the champions of the Amazon Nova AI Challenge, where university teams around the world went head-to-head to break and defend LLMs under real adversarial pressure: 🏆 Defending: UIUC (PurpCorn-PLAN) 🏆 Attacking: Purdue (PurCL) 🥈 CTU Prague & Nova Lisbon These students just

1

0

15

Rui Yang

@RuiYang70669025

2 months

🥳 Excited to share ERA: our training recipe for VLM-based embodied agents with interleaved perception + reasoning, tackling both high-level planning and low-level manipulation. We cover embodied-knowledge data curation and agent RL design. 🔎 Findings 1️⃣ Beyond

arxiv.org

Recent advances in embodied AI highlight the potential of vision language models (VLMs) as agents capable of perception, reasoning, and interaction in complex environments. However, top-performing...

Hanyang Chen

@hc81Jeremy

2 months

🤖️Today we introduce the Embodied Reasoning Agent (ERA), a framework that transforms a compact Vision Language Model (VLM) into a performant and efficient embodied agent. When large models like GPT-4o and Gemini show strong embodied performance on EmbodiedBench, smaller ones

1

14

78

Xiangzhe Xu

@XiangzheX

2 months

Awesome work from Guangyu. It's not possible to fully validate the data supply chain of frontier LLMs. Making LLMs beware of secure behavior by design is the key solution. Just like how we teach human children what are good and bad behaviors.

Guangyu Shen

@guangyuNoah

2 months

@AnthropicAI @AISecurityInst @turinginst Love this work! As models scale, poisoning risk grows since attacker cost barely does. Glad to see frontier labs tackling this. How do address such threats? Our latest paper shows how we cultivate LLMs to realize they’re poisoned and reveal the trigger.

0

3

Xiangzhe Xu

@XiangzheX

2 months

We observe similar challenges while red-teaming trustworthy coding agents. Our tool find misalignments with 90%+ attack success rate. Current alignment practices may hardly scale to domains with deep knowledge (such as coding). https://t.co/AJEfBhxCbU

github.com

🥇 Amazon Nova AI Challenge Winner - ASTRA emerged victorious as the top attacking team in Amazon's global AI safety competition, defeating elite defending teams from universities worldwide ...

fly51fly

@fly51fly

2 months

[LG] The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections M Nasr, N Carlini, C Sitawarin, S V. Schulhoff... [OpenAI & Anthropic & Google DeepMind] (2025) https://t.co/kZ9m6P722x

0

1

Lilian Weng

@lilianweng

2 months

GPUs are expensive and setting up the infrastructure to make GPUs work for you properly is complex, making experimentation on cutting-edge models challenging for researchers and ML practitioners. Providing high quality research tooling is one of the most effective ways to

41

130

2K

Thinking Machines

@thinkymachines

2 months

Introducing Tinker: a flexible API for fine-tuning language models. Write training loops in Python on your laptop; we'll run them on distributed GPUs. Private beta starts today. We can't wait to see what researchers and developers build with cutting-edge open models!

233

805

6K

Martin Tobias (Pre-Seed VC)

@MartinGTobias

3 months

AI isn't the trend. The democratization of intelligence is the trend. Surf that.

7

10

57

Xiangzhe Xu

@XiangzheX

3 months

Interesting work on benchmarking tool using. With fuzzy instructions and multi-goal queries.

AK

@_akhaliq

3 months

MCP-Bench Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

1

0

3

Niels Rogge

@NielsRogge

3 months

Does anyone know the current state of UI agents? Can you build reliable use cases with these? Or still too brittle? Have personally tried @browser_use, worked after a couple of tries for a simple completion of a form, but was quite slow and pretty unreliable

DailyPapers

@HuggingPapers

3 months

Alibaba Group just launched Mobile-Agent-v3! This new SOTA GUI agent framework brings intelligent automation to Android, Ubuntu, macOS, and Windows. It seamlessly handles complex tasks across web browsers and desktop apps.

8

2

11

Xiangzhe Xu

@XiangzheX

3 months

Literally “contamination-free” benchmark: let LLMs predict future events 😎

Rohan Paul

@rohanpaul_ai

3 months

Great new paper and benchmark from ByteDance. FutureX, the world’s first live benchmark for real future prediction — politics, economy, culture, sports, etc. Among 23 AI agents, @grok ranked #1 🏆 FutureX is a live, contamination‑proof benchmark that measures whether LLM

0

1

Xiangzhe Xu

@XiangzheX

3 months

Inspiring insights and design. Valuable to domains with various imbalanced requirements (e.g., safe coding)

Furong Huang

@furongh

6 months

🔥 How can we align #LLMs effectively with messy, imbalanced real-world data? #GRPO is great 🤩—simple, strong, and doesn't even need a learned value function. 😥But it struggles when data isn’t evenly balanced across domains. 🕺💃 Enter 🪩 DISCO 🪩: Domain- & Difficulty-Aware

0

2

Ai2

@allen_ai

3 months

🚀 In March, we launched Paper Finder, an LLM-powered literature search agent that surfaces papers other tools miss. Now, we’re releasing an open-source snapshot to enable others to inspect & build on it—and reproduce the results. 🧵

7

63

456

Xiangzhe Xu

@XiangzheX

3 months

What about SWE-Bench-Live? 🤔

Kilian Lieret ✈️ NeurIPS

@KLieret

3 months

What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately. Read more in the SWE-bench blog 🧵

0

1

Xiangzhe Xu

@XiangzheX

3 months

Benefits from using smaller models will be amplified by better/smarter context management. Especially on edge devices, giving an SLM only necessary contexts greatly reduces costs/latency & imposing less pressure on the context length of SLMs.

0

5

Xiangzhe Xu

@XiangzheX

3 months

Inspiring work. Current coding agent (even the best ones such as @Trae_ai and @allhands_ai) faces similar challenges. They memorized short-term info (e.g., the whole conversation, current repo), but limited long-term knowledges such as user interactions/environments.

Rohan Paul

@rohanpaul_ai

4 months

The paper shows a memory centric multimodal agent that wins on long videos by reusing observations. It introduces M3-Agent, which runs 2 loops, memorization builds long term memory, control reasons over it. Memorization turns the live video and audio stream into 2 kinds of

0

10

Xiangzhe Xu

@XiangzheX

4 months

Interesting to see whether more concise reasoning makes the model more robust. Previous we identified many logic holes in models’ reasoning that could be exploited. 🤔Would a shorter reasoning imply less exploitable flaws?

Vaish Shrivastava

@VaishShrivas

4 months

Test-time scaling w/ GRPO boosts accuracy, but also adds “filler tokens” increasing length w/o real progress. We present Group Filtered Policy Optimization (GFPO):🧵 1️⃣ Sample more per prompt 2️⃣ Rank by token efficiency (reward ÷ length) 3️⃣ Train on top-k 4️⃣ 🚀 Cut 80% of

0

1

Xiangzhe Xu

@XiangzheX

4 months

It would be interesting to see what’s the bottleneck and boundaries of self evolution. There seems to be huge unsolved parts even for domains with verifiable rewards (e.g., SWE-Bench-Live)

Rohan Paul

@rohanpaul_ai

4 months

Absolutely Golden resource: A Comprehensive Survey of Self-Evolving AI Agents Self‑evolving agents are built to adapt themselves safely, not just run fixed scripts, guided by 3 laws, endure, excel, evolve. The survey maps a 4‑stage shift, MOP (Model Offline Pretraining) to

0

3

Xiangzhe Xu

@XiangzheX

4 months

Having comprehensive policies is extremely hard yet important for complex domains (e.g., coding & cyber security). We found holes in models with even strongest defense ( https://t.co/AJEfBhya1s), including constructing a runnable ransomware with Claude ( https://t.co/ToLyrzDfmo)

arxiv.org

Large language models (LLMs) have democratized software development, reducing the expertise barrier for programming complex applications. This accessibility extends to malicious software...

Anthropic

@AnthropicAI

4 months

Today we're sharing a post on how our Safeguards team identifies potential misuse of our models and builds defenses against it.

0

7

Xiangzhe Xu

@XiangzheX

4 months

@AnthropicAI Our technique explores diverse user requests while preserving the intent.

0

1

Xiangzhe Xu

@XiangzheX

4 months

@AnthropicAI Our paper:

1

0

1

Xiangzhe Xu

@XiangzheX

4 months

Echoing @AnthropicAI’s vision for safe, trustworthy agents, our new paper (arXiv:2506.07524) tests agent intent-understanding across 80 APIs in 5 domains—revealing where misinterpretations happen and why transparently linking behavior to true user goals is essential. 🔗 Anthropic

anthropic.com

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

1

0

4