Xianjun Yang @xianjun_agi X Profile

Xianjun Yang

@xianjun_agi

Followers

5K

Following

851

Media

18

Statuses

429

RS @AIatMeta. AI search/reasoning/agent/safety. Prev Phd @ucsbnlp, BEng @tsinghua_uni. Opinions are my own. Fast learner with strong intellectual curiosity

https://t.co/BOF1nPDOUA

Mountain View, CA

Joined February 2020

Don't wanna be here? Send us removal request.

Xianjun Yang

@xianjun_agi

17 days

I was laid off by Meta today. As a Research Scientist, my work was just cited by the legendary @johnschulman2 and Nicholas Carlini yesterday. I’m actively looking for new opportunities — please reach out if you have any openings!

Susan Zhang

@suchenzang

18 days

👀

281

380

5K

Zi Wang, Ph.D.

@ziwphd

6 days

Our QuestBench paper was accepted at NeurIPS 2025 Track on Datasets and Benchmarks! Check out the updated paper https://t.co/MxOdGVh95P Joint work with @belindazli and @_beenkim. (Just found this nice summary. Thanks for posting about our work.)

Rohan Paul

@rohanpaul_ai

7 months

Super cool paper from @GoogleDeepMind Real-world queries for LLMs often lack necessary information for reasoning tasks. This paper tackles this by framing underspecification as a Constraint Satisfaction Problem where one variable is missing. It introduces QuestBench, a

2

14

79

VentureBeat

@VentureBeat

9 days

Circuit-based Reasoning Verification (CRV) treats an LLM's reasoning process like an execution trace in classical software to debug and correct it at inference. https://t.co/u4x6KcWCMs

venturebeat.com

The proof-of-concept could pave the way for a new class of AI debuggers, making language models more reliable for business-critical applications.

0

3

10

Zheng Zhao @EMNLP🇨🇳

@zhengzhao97

16 days

Thrilled to share our latest research on verifying CoT reasonings, completed during my recent internship at FAIR @metaai. In this work, we introduce Circuit-based Reasoning Verification (CRV), a new white-box method to analyse and verify how LLMs reason, step-by-step.

7

49

334

Xianjun Yang

@xianjun_agi

16 days

As a new grad and early-career researcher, I’m truly overwhelmed and grateful for the incredible support from the community. Within 24 hours, I’ve received hundreds of kind messages and job opportunities— a reminder of how warm and vibrant the AI community is. I’ll take time to

arxiv.org

Current Chain-of-Thought (CoT) verification methods predict reasoning correctness based on outputs (black-box) or activations (gray-box), but offer limited insight into why a computation fails. We...

18

48

686

John Schulman

@johnschulman2

18 days

Fine-tuning APIs are becoming more powerful and widespread, but they're harder to safeguard against misuse than fixed-weight sampling APIs. Excited to share a new paper: Detecting Adversarial Fine-tuning with Auditing Agents ( https://t.co/NqMeGSCQIF). Auditing agents search

arxiv.org

Large Language Model (LLM) providers expose fine-tuning APIs that let end users fine-tune their frontier LLMs. Unfortunately, it has been shown that an adversary with fine-tuning access to an LLM...

10

50

466

Caiming Xiong

@CaimingXiong

20 days

One of the key challenges for building web-based “deep research” agents is to construct sufficiently difficult long-horizon agentic data. At @SFResearch, We introduce ProgSearch, a controlled data synthesis pipeline that builds tasks of increasing complexity until a frontier

3

27

111

Mengdi Wang

@MengdiWang10

23 days

🚀 Introducing LabOS: The AI-XR Co-Scientist A system that sees, understands, and works with humans in real-world labs. 👁️ Egocentric vision & extended reality 🧠 LLM reasoning & hypothesis generation 🤖 Real-time guidance & multi-modal human-AI collaboration From observation →

10

24

156

Thomas Fel

@Napoolar

26 days

🕳️🐇Into the Rabbit Hull – Part I (Part II tomorrow) An interpretability deep dive into DINOv2, one of vision’s most important foundation models. And today is Part I, buckle up, we're exploring some of its most charming features.

10

119

639

Dongrui Liu

@dong_rui39501

1 month

Self-Evolving AI Risks "Misevolution" Even top LLMs (Gemini-2.5-Pro, GPT-4o) face this—agents drift into harm: over-refunding, reusing insecure tools, losing safety alignment. First study on this! https://t.co/DeBJFdTOtF

1

10

20

Xianjun Yang

@xianjun_agi

2 months

Glad to be the early user of ARE! Congrats @amine_benh for the release!

Thomas Scialom

@ThomasScialom

2 months

🚀 ARE: scaling up agent environments and evaluations Everyone talks about RL envs so we built one we actually use. In the second half of AI, evals & envs are the bottleneck. Today we OSS it all: Meta Agent Research Environment + GAIA-2 (code, demo, evals). 🔗Links👇

1

0

8

Suhas Kotha

@kothasuhas

2 months

Since compute grows faster than the web, we think the future of pre-training lies in the algorithms that will best leverage ♾ compute We find simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better perf w/ sufficient compute

9

81

446

Xianjun Yang

@xianjun_agi

2 months

It's interesting that the whole ranking changed a lot🧐 when choosing style control rather than by default

lmarena.ai

@arena

2 months

🚨 Leaderboard Disrupted! Grok-4-fast by @xAI has arrived in the Arena, and it’s shaking things up! ⚡️ 🏆 #1 on the Search Leaderboard Tested under the codename “menlo,” Grok-4-fast-search just rocketed to the top spot with the community. 💠 Tied for #8 on the Text Leaderboard

0

Xianjun Yang

@xianjun_agi

2 months

👀

Eliezer Yudkowsky ⏹️

@ESYudkowsky

2 months

"If Anyone Builds It, Everyone Dies" is now out. Read it today if you want to see with fresh eyes what's truly there, before others try to prime your brain to see something else instead!

1

Caiming Xiong

@CaimingXiong

2 months

Meet SFR-DeepResearch (SFR-DR) 🤖: our RL-trained autonomous agents that can reason, search, and code their way through deep research tasks. 🚀SFR-DR-20B achieves 28.7% on Humanity's Last Exam (text-only) using only web search 🔍, browsing 🌐, and Python interpreter 🐍,

28

146

953

Wenhao Yu

@wyu_nd

2 months

𝑳𝑳𝑴𝒔 should use 𝒑𝙖𝒓𝙖𝒍𝙡𝒆𝙡 𝙩𝒉𝙞𝒏𝙠𝒊𝙣𝒈! -- consider multiple possibilities at once before synthesizing them into coherent output. -- the key is knowing when and how to parallelize, which we study in our new Parallel-R1 paper: https://t.co/f55GDJnvD1

28

81

578

Wenhu Chen

@WenhuChen

2 months

Super thrilled to WebExplorer, which is a simple yet effective approach to train long-horizon web agents. Instead of depending heavily on rigid pre-defined graph structures, WebExplorer utilizes the model-based exploration strategy to synthesize high-quality agentic data. Our 8B

AK

@_akhaliq

2 months

WebExplorer Explore and Evolve for Training Long-Horizon Web Agents

4

15

175

Xianjun Yang

@xianjun_agi

2 months

Diversity is the key to intelligence

Jason Weston

@jaseweston

2 months

🌀Diversity Aware RL (DARLING)🌀 📝: https://t.co/MH0tui34Cb - Jointly optimizes for quality & diversity using a learned partition function - Outperforms standard RL in quality AND diversity metrics, e.g. higher pass@1/p@k - Works for both non-verifiable & verifiable tasks 🧵1/5

0

5

Xianjun Yang

@xianjun_agi

2 months

😍

Reka

@RekaAILabs

2 months

🚨 New benchmark release 🚨 We're introducing Research-Eval: a diverse, high-quality benchmark for evaluating search-augmented LLMs 👉 Blogpost: https://t.co/fLJonKsJ08 👉 Dataset + code: https://t.co/fJiySwgO4s 🧵 Here's why this matters:

0