CLS @ChengleiSi X Profile

CLS

@ChengleiSi

Followers

5K

Following

24K

Media

41

Statuses

3K

PhDing @stanfordnlp | teaching language models to do research

https://t.co/BQaALVnvOO

Palo Alto, California

Joined August 2018

Don't wanna be here? Send us removal request.

CLS

@ChengleiSi

5 months

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

12

196

634

DeepSeek

@deepseek_ai

12 hours

🏆 World-Leading Reasoning 🔹 V3.2: Balanced inference vs. length. Your daily driver at GPT-5 level performance. 🔹 V3.2-Speciale: Maxed-out reasoning capabilities. Rivals Gemini-3.0-Pro. 🥇 Gold-Medal Performance: V3.2-Speciale attains gold-level results in IMO, CMO, ICPC World

57

244

3K

Miahdogtags

@miahdogtags

4 months

I Never Dreamed I'd Grow Up To Be A Spoiled Wife Of A Grumpy Old Husband. Love this fun t-shirt - Get yours now!

13

29

234

Thomas Bloom

@thomasfbloom

1 day

As the owner/maintainer of the Erdős problems website, a thread with some comments on this solution to #124: 1) This is a nice proof, which was provided by the AI from the formal statement with no human involvement and then formalised in Lean. This is already impressive!

Vlad Tenev

@vladtenev

2 days

We are on the cusp of a profound change in the field of mathematics. Vibe proving is here. Aristotle from @HarmonicMath just proved Erdos Problem #124 in @leanprover, all by itself. This problem has been open for nearly 30 years since conjectured in the paper “Complete sequences

22

112

1K

Zhihong Shao

@zhs05232838

4 days

We just shared some thoughts and results on self-verifiable mathematical reasoning. The released model, DeepSeekMath-V2, is strong on IMO-ProofBench and competitions like IMO 2025 (5/6 problems) and Putnam 2024 (a near-perfect score of 118/120). Github: https://t.co/4dMEqWxXfU

28

78

673

Akari Asai

@AkariAsai

6 days

1/ Hiring PhD students at CMU SCS (LTI/MLD) for Fall 2026 (Deadline 12/10) 🎓 I work on open, reliable LMs: augmented LMs & agents (RAG, tool use, deep research), safety (hallucinations, copyright), and AI for science, code & multilinguality & open to bold new ideas! FAQ in 🧵

16

119

575

RedChip

@RedChip

3 days

🎥 Enlivex Therapeutics $ENLV: CEO Interview Highlights Breakthrough Osteoarthritis Data In a recent interview, CEO Oren Hershkovitz shared compelling Phase I/II results from Allocetra™ in age-related osteoarthritis. With a $7B+ global osteoarthritis market, no FDA-approved

0

2

6

Pratyusha Sharma

@pratyusha_PS

10 days

📢 Some big (& slightly belated) life updates! 1. I defended my PhD at MIT this summer! 🎓 2. I'm joining NYU as an Assistant Professor starting Fall 2026, with a joint appointment in Courant CS and the Center for Data Science. 🎉 🔬 My lab will focus on empirically studying

100

89

2K

Minyang Tian ✈️ NeurIPS

@MinyangTian1

10 days

Can LLMs help physicists break new ground in real frontier research? We introduce CritPt (Complex Research using Integrated Thinking - Physics Test, pronounced "Critical Point"): the first benchmark of unpublished, realistic research-level reasoning challenges broadly spanning

13

22

156

Epoch AI

@EpochAIResearch

11 days

Benchmarking data is dominated by a single “General Capability” dimension. Is this due to good generalization across tasks, or to developers pushing on all benchmarks at once? 🧵 with some analysis, including the discovery of a “Claudiness” dimension.

7

27

276

finbarr

@finbarrtimbers

11 days

The OlmoRL infrastructure was 4x faster than Olmo 2 and made it much cheaper to run experiments. Some of the changes: 1. continuous batching 2. in-flight updates 3. active sampling 4. many many improvements to our multi-threading code

4

15

178

Luca Soldaini 🌯 NeurIPS 2025

@soldni

11 days

We are releasing a LARGE new collection of science PDFs we linearized with olmOCR! great for our first long context model. It was fun to use synth data to boost long context–all using Olmo 2! Older bro helping younger sibiling 🥹

2

4

42

Brian Hie

@BrianHie

12 days

Today in @Nature, in work led by @aditimerch, we report the ability to prompt Evo to generate functional de novo genes. You shall know a gene by the company it keeps! 1/n

7

103

543

John Hewitt

@johnhewtt

12 days

Come do a PhD with me at Columbia! My lab tackles basic problems in alignment, interpretability, safety, and capabilities of language systems. If you love adventuring in model internals and behaviors---to understand and improve---let's do it together! pic: a run in central park

12

129

945

Tony Zhao

@tonyzzhao

12 days

Today, we present a step-change in robotic AI @sundayrobotics. Introducing ACT-1: A frontier robot foundation model trained on zero robot data. - Ultra long-horizon tasks - Zero-shot generalization - Advanced dexterity 🧵->

425

658

5K

Stanford NLP Group

@stanfordnlp

12 days

How Stanford researchers design human-focused AI systems: “AI products enter the real world very quickly, often without a rigorous understanding of their impact or the consequences of their use. We need to move forward with responsibility.” —@Diyi_Yang https://t.co/wO0c8LbPsK

3

10

85

Rulin Shao ✈️ NeurIPS

@RulinShao

13 days

🔥Thrilled to introduce DR Tulu-8B, an open long-form Deep Research model that matches OpenAI DR 💪Yes, just 8B! 🚀 The secret? We present Reinforcement Learning with Evolving Rubrics (RLER) for long-form non-verifiable DR tasks! Our rubrics: - co-evolve with the policy model -

7

116

536

Caiming Xiong

@CaimingXiong

14 days

🤖🧠 LLM agents are becoming adept at reasoning over complex codebases, yet they remain static, rarely learning from their own experience. We introduce SAGE (Self Abstraction from Grounded Experience), a framework that enables agents to reflect on past rollouts, distill

3

17

86

Tanmay Agarwal

@tanmayxagarwal

16 days

It's 2023 and you go up to the 2nd Floor in Stanford's Gates CS Building. If you turn right, first on your left you would see @chelseabfinn's lab through the glass, with @tonyzzhao working on his Aloha setup. If you keep going straight, you enter the bullpen for @StanfordSVL.

5

3

156

(((ل()(ل() 'yoav))))👾

@yoavgo

17 days

as LLM-based systems improve and produce "novel research papers" that are actually correct and properly written (not a very high bar, probably around the corner) I wonder if we will have a new category when discussing research(ers): "work that could have been done by an AI".

2

3

28

Anthropic

@AnthropicAI

19 days

New Anthropic research: Project Fetch. We asked two teams of Anthropic researchers to program a robot dog. Neither team had any robotics expertise—but we let only one team use Claude. How did they do?

80

200

2K

Micah Goldblum

@micahgoldblum

18 days

An LLM-generated paper is in the top 17% of ICLR submissions in terms of average reviewer score, having received two 8's. The paper has tons of BS jargon and hallucinated references. Fortunately, one reviewer actually looked at the paper and gave it a zero. 1/3

40

145

1K

Tong Chen @ NeurIPS

@tomchen0

18 days

OpenAI's blog ( https://t.co/Mu05PFfPXg) points out that today’s language models hallucinate because training and evaluation reward guessing instead of admitting uncertainty. This raises a natural question: can we reduce hallucination without hurting utility?🤔 On-policy RL with

25

123

669