CLS
@ChengleiSi
Followers
5K
Following
24K
Media
41
Statuses
3K
PhDing @stanfordnlp | teaching language models to do research
Palo Alto, California
Joined August 2018
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
12
196
634
🏆 World-Leading Reasoning 🔹 V3.2: Balanced inference vs. length. Your daily driver at GPT-5 level performance. 🔹 V3.2-Speciale: Maxed-out reasoning capabilities. Rivals Gemini-3.0-Pro. 🥇 Gold-Medal Performance: V3.2-Speciale attains gold-level results in IMO, CMO, ICPC World
57
244
3K
I Never Dreamed I'd Grow Up To Be A Spoiled Wife Of A Grumpy Old Husband. Love this fun t-shirt - Get yours now!
13
29
234
As the owner/maintainer of the Erdős problems website, a thread with some comments on this solution to #124: 1) This is a nice proof, which was provided by the AI from the formal statement with no human involvement and then formalised in Lean. This is already impressive!
We are on the cusp of a profound change in the field of mathematics. Vibe proving is here. Aristotle from @HarmonicMath just proved Erdos Problem #124 in @leanprover, all by itself. This problem has been open for nearly 30 years since conjectured in the paper “Complete sequences
22
112
1K
We just shared some thoughts and results on self-verifiable mathematical reasoning. The released model, DeepSeekMath-V2, is strong on IMO-ProofBench and competitions like IMO 2025 (5/6 problems) and Putnam 2024 (a near-perfect score of 118/120). Github: https://t.co/4dMEqWxXfU
28
78
673
1/ Hiring PhD students at CMU SCS (LTI/MLD) for Fall 2026 (Deadline 12/10) 🎓 I work on open, reliable LMs: augmented LMs & agents (RAG, tool use, deep research), safety (hallucinations, copyright), and AI for science, code & multilinguality & open to bold new ideas! FAQ in 🧵
16
119
575
🎥 Enlivex Therapeutics $ENLV: CEO Interview Highlights Breakthrough Osteoarthritis Data In a recent interview, CEO Oren Hershkovitz shared compelling Phase I/II results from Allocetra™ in age-related osteoarthritis. With a $7B+ global osteoarthritis market, no FDA-approved
0
2
6
📢 Some big (& slightly belated) life updates! 1. I defended my PhD at MIT this summer! 🎓 2. I'm joining NYU as an Assistant Professor starting Fall 2026, with a joint appointment in Courant CS and the Center for Data Science. 🎉 🔬 My lab will focus on empirically studying
100
89
2K
Can LLMs help physicists break new ground in real frontier research? We introduce CritPt (Complex Research using Integrated Thinking - Physics Test, pronounced "Critical Point"): the first benchmark of unpublished, realistic research-level reasoning challenges broadly spanning
13
22
156
Benchmarking data is dominated by a single “General Capability” dimension. Is this due to good generalization across tasks, or to developers pushing on all benchmarks at once? 🧵 with some analysis, including the discovery of a “Claudiness” dimension.
7
27
276
The OlmoRL infrastructure was 4x faster than Olmo 2 and made it much cheaper to run experiments. Some of the changes: 1. continuous batching 2. in-flight updates 3. active sampling 4. many many improvements to our multi-threading code
4
15
178
We are releasing a LARGE new collection of science PDFs we linearized with olmOCR! great for our first long context model. It was fun to use synth data to boost long context–all using Olmo 2! Older bro helping younger sibiling 🥹
2
4
42
Today in @Nature, in work led by @aditimerch, we report the ability to prompt Evo to generate functional de novo genes. You shall know a gene by the company it keeps! 1/n
7
103
543
Come do a PhD with me at Columbia! My lab tackles basic problems in alignment, interpretability, safety, and capabilities of language systems. If you love adventuring in model internals and behaviors---to understand and improve---let's do it together! pic: a run in central park
12
129
945
Today, we present a step-change in robotic AI @sundayrobotics. Introducing ACT-1: A frontier robot foundation model trained on zero robot data. - Ultra long-horizon tasks - Zero-shot generalization - Advanced dexterity 🧵->
425
658
5K
How Stanford researchers design human-focused AI systems: “AI products enter the real world very quickly, often without a rigorous understanding of their impact or the consequences of their use. We need to move forward with responsibility.” —@Diyi_Yang
https://t.co/wO0c8LbPsK
3
10
85
🔥Thrilled to introduce DR Tulu-8B, an open long-form Deep Research model that matches OpenAI DR 💪Yes, just 8B! 🚀 The secret? We present Reinforcement Learning with Evolving Rubrics (RLER) for long-form non-verifiable DR tasks! Our rubrics: - co-evolve with the policy model -
7
116
536
🤖🧠 LLM agents are becoming adept at reasoning over complex codebases, yet they remain static, rarely learning from their own experience. We introduce SAGE (Self Abstraction from Grounded Experience), a framework that enables agents to reflect on past rollouts, distill
3
17
86
It's 2023 and you go up to the 2nd Floor in Stanford's Gates CS Building. If you turn right, first on your left you would see @chelseabfinn's lab through the glass, with @tonyzzhao working on his Aloha setup. If you keep going straight, you enter the bullpen for @StanfordSVL.
5
3
156
as LLM-based systems improve and produce "novel research papers" that are actually correct and properly written (not a very high bar, probably around the corner) I wonder if we will have a new category when discussing research(ers): "work that could have been done by an AI".
2
3
28
New Anthropic research: Project Fetch. We asked two teams of Anthropic researchers to program a robot dog. Neither team had any robotics expertise—but we let only one team use Claude. How did they do?
80
200
2K
An LLM-generated paper is in the top 17% of ICLR submissions in terms of average reviewer score, having received two 8's. The paper has tons of BS jargon and hallucinated references. Fortunately, one reviewer actually looked at the paper and gave it a zero. 1/3
40
145
1K
OpenAI's blog ( https://t.co/Mu05PFfPXg) points out that today’s language models hallucinate because training and evaluation reward guessing instead of admitting uncertainty. This raises a natural question: can we reduce hallucination without hurting utility?🤔 On-policy RL with
25
123
669