Xianjun Yang
@xianjun_agi
Followers
5K
Following
851
Media
18
Statuses
429
RS @AIatMeta. AI search/reasoning/agent/safety. Prev Phd @ucsbnlp, BEng @tsinghua_uni. Opinions are my own. Fast learner with strong intellectual curiosity
Mountain View, CA
Joined February 2020
I was laid off by Meta today. As a Research Scientist, my work was just cited by the legendary @johnschulman2 and Nicholas Carlini yesterday. Iโm actively looking for new opportunities โ please reach out if you have any openings!
281
380
5K
Our QuestBench paper was accepted at NeurIPS 2025 Track on Datasets and Benchmarks! Check out the updated paper https://t.co/MxOdGVh95P Joint work with @belindazli and @_beenkim. (Just found this nice summary. Thanks for posting about our work.)
Super cool paper from @GoogleDeepMind Real-world queries for LLMs often lack necessary information for reasoning tasks. This paper tackles this by framing underspecification as a Constraint Satisfaction Problem where one variable is missing. It introduces QuestBench, a
2
14
79
Circuit-based Reasoning Verification (CRV) treats an LLM's reasoning process like an execution trace in classical software to debug and correct it at inference. https://t.co/u4x6KcWCMs
venturebeat.com
The proof-of-concept could pave the way for a new class of AI debuggers, making language models more reliable for business-critical applications.
0
3
10
Thrilled to share our latest research on verifying CoT reasonings, completed during my recent internship at FAIR @metaai. In this work, we introduce Circuit-based Reasoning Verification (CRV), a new white-box method to analyse and verify how LLMs reason, step-by-step.
7
49
334
As a new grad and early-career researcher, Iโm truly overwhelmed and grateful for the incredible support from the community. Within 24 hours, Iโve received hundreds of kind messages and job opportunitiesโ a reminder of how warm and vibrant the AI community is. Iโll take time to
arxiv.org
Current Chain-of-Thought (CoT) verification methods predict reasoning correctness based on outputs (black-box) or activations (gray-box), but offer limited insight into why a computation fails. We...
18
48
686
Fine-tuning APIs are becoming more powerful and widespread, but they're harder to safeguard against misuse than fixed-weight sampling APIs. Excited to share a new paper: Detecting Adversarial Fine-tuning with Auditing Agents ( https://t.co/NqMeGSCQIF). Auditing agents search
arxiv.org
Large Language Model (LLM) providers expose fine-tuning APIs that let end users fine-tune their frontier LLMs. Unfortunately, it has been shown that an adversary with fine-tuning access to an LLM...
10
50
466
One of the key challenges for building web-based โdeep researchโ agents is to construct sufficiently difficult long-horizon agentic data. At @SFResearch, We introduce ProgSearch, a controlled data synthesis pipeline that builds tasks of increasing complexity until a frontier
3
27
111
๐ Introducing LabOS: The AI-XR Co-Scientist A system that sees, understands, and works with humans in real-world labs. ๐๏ธ Egocentric vision & extended reality ๐ง LLM reasoning & hypothesis generation ๐ค Real-time guidance & multi-modal human-AI collaboration From observation โ
10
24
156
๐ณ๏ธ๐Into the Rabbit Hull โ Part I (Part II tomorrow) An interpretability deep dive into DINOv2, one of visionโs most important foundation models. And today is Part I, buckle up, we're exploring some of its most charming features.
10
119
639
Self-Evolving AI Risks "Misevolution" Even top LLMs (Gemini-2.5-Pro, GPT-4o) face thisโagents drift into harm: over-refunding, reusing insecure tools, losing safety alignment. First study on this! https://t.co/DeBJFdTOtF
1
10
20
Glad to be the early user of ARE! Congrats @amine_benh for the release!
๐ ARE: scaling up agent environments and evaluations Everyone talks about RL envs so we built one we actually use. In the second half of AI, evals & envs are the bottleneck. Today we OSS it all: Meta Agent Research Environment + GAIA-2 (code, demo, evals). ๐Links๐
1
0
8
Since compute grows faster than the web, we think the future of pre-training lies in the algorithms that will best leverage โพ compute We find simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better perf w/ sufficient compute
9
81
446
It's interesting that the whole ranking changed a lot๐ง when choosing style control rather than by default
๐จ Leaderboard Disrupted! Grok-4-fast by @xAI has arrived in the Arena, and itโs shaking things up! โก๏ธ ๐ #1 on the Search Leaderboard Tested under the codename โmenlo,โ Grok-4-fast-search just rocketed to the top spot with the community. ๐ Tied for #8 on the Text Leaderboard
0
0
0
Meet SFR-DeepResearch (SFR-DR) ๐ค: our RL-trained autonomous agents that can reason, search, and code their way through deep research tasks. ๐SFR-DR-20B achieves 28.7% on Humanity's Last Exam (text-only) using only web search ๐, browsing ๐, and Python interpreter ๐,
28
146
953
๐ณ๐ณ๐ด๐ should use ๐๐๐๐๐๐ก๐๐ก ๐ฉ๐๐๐๐ ๐๐ฃ๐! -- consider multiple possibilities at once before synthesizing them into coherent output. -- the key is knowing when and how to parallelize, which we study in our new Parallel-R1 paper: https://t.co/f55GDJnvD1
28
81
578
Super thrilled to WebExplorer, which is a simple yet effective approach to train long-horizon web agents. Instead of depending heavily on rigid pre-defined graph structures, WebExplorer utilizes the model-based exploration strategy to synthesize high-quality agentic data. Our 8B
4
15
175
Diversity is the key to intelligence
๐Diversity Aware RL (DARLING)๐ ๐: https://t.co/MH0tui34Cb - Jointly optimizes for quality & diversity using a learned partition function - Outperforms standard RL in quality AND diversity metrics, e.g. higher pass@1/p@k - Works for both non-verifiable & verifiable tasks ๐งต1/5
0
0
5
๐
๐จ New benchmark release ๐จ We're introducing Research-Eval: a diverse, high-quality benchmark for evaluating search-augmented LLMs ๐ Blogpost: https://t.co/fLJonKsJ08 ๐ Dataset + code: https://t.co/fJiySwgO4s ๐งต Here's why this matters:
0
0
0