xianjun_agi Profile Banner
Xianjun Yang Profile
Xianjun Yang

@xianjun_agi

Followers
5K
Following
851
Media
18
Statuses
429

RS @AIatMeta. AI search/reasoning/agent/safety. Prev Phd @ucsbnlp, BEng @tsinghua_uni. Opinions are my own. Fast learner with strong intellectual curiosity

Mountain View, CA
Joined February 2020
Don't wanna be here? Send us removal request.
@xianjun_agi
Xianjun Yang
17 days
I was laid off by Meta today. As a Research Scientist, my work was just cited by the legendary @johnschulman2 and Nicholas Carlini yesterday. Iโ€™m actively looking for new opportunities โ€” please reach out if you have any openings!
@suchenzang
Susan Zhang
18 days
๐Ÿ‘€
281
380
5K
@ziwphd
Zi Wang, Ph.D.
6 days
Our QuestBench paper was accepted at NeurIPS 2025 Track on Datasets and Benchmarks! Check out the updated paper https://t.co/MxOdGVh95P Joint work with @belindazli and @_beenkim. (Just found this nice summary. Thanks for posting about our work.)
@rohanpaul_ai
Rohan Paul
7 months
Super cool paper from @GoogleDeepMind Real-world queries for LLMs often lack necessary information for reasoning tasks. This paper tackles this by framing underspecification as a Constraint Satisfaction Problem where one variable is missing. It introduces QuestBench, a
2
14
79
@VentureBeat
VentureBeat
9 days
Circuit-based Reasoning Verification (CRV) treats an LLM's reasoning process like an execution trace in classical software to debug and correct it at inference. https://t.co/u4x6KcWCMs
Tweet card summary image
venturebeat.com
The proof-of-concept could pave the way for a new class of AI debuggers, making language models more reliable for business-critical applications.
0
3
10
@zhengzhao97
Zheng Zhao @EMNLP๐Ÿ‡จ๐Ÿ‡ณ
16 days
Thrilled to share our latest research on verifying CoT reasonings, completed during my recent internship at FAIR @metaai. In this work, we introduce Circuit-based Reasoning Verification (CRV), a new white-box method to analyse and verify how LLMs reason, step-by-step.
7
49
334
@xianjun_agi
Xianjun Yang
16 days
As a new grad and early-career researcher, Iโ€™m truly overwhelmed and grateful for the incredible support from the community. Within 24 hours, Iโ€™ve received hundreds of kind messages and job opportunitiesโ€” a reminder of how warm and vibrant the AI community is. Iโ€™ll take time to
Tweet card summary image
arxiv.org
Current Chain-of-Thought (CoT) verification methods predict reasoning correctness based on outputs (black-box) or activations (gray-box), but offer limited insight into why a computation fails. We...
18
48
686
@johnschulman2
John Schulman
18 days
Fine-tuning APIs are becoming more powerful and widespread, but they're harder to safeguard against misuse than fixed-weight sampling APIs. Excited to share a new paper: Detecting Adversarial Fine-tuning with Auditing Agents ( https://t.co/NqMeGSCQIF). Auditing agents search
Tweet card summary image
arxiv.org
Large Language Model (LLM) providers expose fine-tuning APIs that let end users fine-tune their frontier LLMs. Unfortunately, it has been shown that an adversary with fine-tuning access to an LLM...
10
50
466
@CaimingXiong
Caiming Xiong
20 days
One of the key challenges for building web-based โ€œdeep researchโ€ agents is to construct sufficiently difficult long-horizon agentic data. At @SFResearch, We introduce ProgSearch, a controlled data synthesis pipeline that builds tasks of increasing complexity until a frontier
3
27
111
@MengdiWang10
Mengdi Wang
23 days
๐Ÿš€ Introducing LabOS: The AI-XR Co-Scientist A system that sees, understands, and works with humans in real-world labs. ๐Ÿ‘๏ธ Egocentric vision & extended reality ๐Ÿง  LLM reasoning & hypothesis generation ๐Ÿค– Real-time guidance & multi-modal human-AI collaboration From observation โ†’
10
24
156
@Napoolar
Thomas Fel
26 days
๐Ÿ•ณ๏ธ๐Ÿ‡Into the Rabbit Hull โ€“ Part I (Part II tomorrow) An interpretability deep dive into DINOv2, one of visionโ€™s most important foundation models. And today is Part I, buckle up, we're exploring some of its most charming features.
10
119
639
@dong_rui39501
Dongrui Liu
1 month
Self-Evolving AI Risks "Misevolution" Even top LLMs (Gemini-2.5-Pro, GPT-4o) face thisโ€”agents drift into harm: over-refunding, reusing insecure tools, losing safety alignment. First study on this! https://t.co/DeBJFdTOtF
1
10
20
@xianjun_agi
Xianjun Yang
2 months
Glad to be the early user of ARE! Congrats @amine_benh for the release!
@ThomasScialom
Thomas Scialom
2 months
๐Ÿš€ ARE: scaling up agent environments and evaluations Everyone talks about RL envs so we built one we actually use. In the second half of AI, evals & envs are the bottleneck. Today we OSS it all: Meta Agent Research Environment + GAIA-2 (code, demo, evals). ๐Ÿ”—Links๐Ÿ‘‡
1
0
8
@kothasuhas
Suhas Kotha
2 months
Since compute grows faster than the web, we think the future of pre-training lies in the algorithms that will best leverage โ™พ compute We find simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better perf w/ sufficient compute
9
81
446
@xianjun_agi
Xianjun Yang
2 months
It's interesting that the whole ranking changed a lot๐Ÿง when choosing style control rather than by default
@arena
lmarena.ai
2 months
๐Ÿšจ Leaderboard Disrupted! Grok-4-fast by @xAI has arrived in the Arena, and itโ€™s shaking things up! โšก๏ธ ๐Ÿ† #1 on the Search Leaderboard Tested under the codename โ€œmenlo,โ€ Grok-4-fast-search just rocketed to the top spot with the community. ๐Ÿ’  Tied for #8 on the Text Leaderboard
0
0
0
@xianjun_agi
Xianjun Yang
2 months
๐Ÿ‘€
@ESYudkowsky
Eliezer Yudkowsky โน๏ธ
2 months
"If Anyone Builds It, Everyone Dies" is now out. Read it today if you want to see with fresh eyes what's truly there, before others try to prime your brain to see something else instead!
1
1
1
@CaimingXiong
Caiming Xiong
2 months
Meet SFR-DeepResearch (SFR-DR) ๐Ÿค–: our RL-trained autonomous agents that can reason, search, and code their way through deep research tasks. ๐Ÿš€SFR-DR-20B achieves 28.7% on Humanity's Last Exam (text-only) using only web search ๐Ÿ”, browsing ๐ŸŒ, and Python interpreter ๐Ÿ,
28
146
953
@wyu_nd
Wenhao Yu
2 months
๐‘ณ๐‘ณ๐‘ด๐’” should use ๐’‘๐™–๐’“๐™–๐’๐™ก๐’†๐™ก ๐™ฉ๐’‰๐™ž๐’๐™ ๐’Š๐™ฃ๐’ˆ! -- consider multiple possibilities at once before synthesizing them into coherent output. -- the key is knowing when and how to parallelize, which we study in our new Parallel-R1 paper: https://t.co/f55GDJnvD1
28
81
578
@WenhuChen
Wenhu Chen
2 months
Super thrilled to WebExplorer, which is a simple yet effective approach to train long-horizon web agents. Instead of depending heavily on rigid pre-defined graph structures, WebExplorer utilizes the model-based exploration strategy to synthesize high-quality agentic data. Our 8B
@_akhaliq
AK
2 months
WebExplorer Explore and Evolve for Training Long-Horizon Web Agents
4
15
175
@xianjun_agi
Xianjun Yang
2 months
Diversity is the key to intelligence
@jaseweston
Jason Weston
2 months
๐ŸŒ€Diversity Aware RL (DARLING)๐ŸŒ€ ๐Ÿ“: https://t.co/MH0tui34Cb - Jointly optimizes for quality & diversity using a learned partition function - Outperforms standard RL in quality AND diversity metrics, e.g. higher pass@1/p@k - Works for both non-verifiable & verifiable tasks ๐Ÿงต1/5
0
0
5
@xianjun_agi
Xianjun Yang
2 months
๐Ÿ˜
@RekaAILabs
Reka
2 months
๐Ÿšจ New benchmark release ๐Ÿšจ We're introducing Research-Eval: a diverse, high-quality benchmark for evaluating search-augmented LLMs ๐Ÿ‘‰ Blogpost: https://t.co/fLJonKsJ08 ๐Ÿ‘‰ Dataset + code: https://t.co/fJiySwgO4s ๐Ÿงต Here's why this matters:
0
0
0