Ludwig Schmidt @lschmidt3 X Profile

Ludwig Schmidt

@lschmidt3

Followers

6K

Following

2K

Media

3

Statuses

233

Assistant professor at @Stanford and member of the technical staff at @AnthropicAI.

https://t.co/GmYP6U8LRV

Palo Alto, CA

Joined August 2009

Don't wanna be here? Send us removal request.

Anas Awadalla

@anas_awadalla

8 hours

We're releasing🍨Gelato-30B-A3B, a state-of-the-art computer grounding model that delivers immediate performance gains for computer-use agents! Trained on our open-source🖱️Click-100k dataset, Gelato achieves 63.8% on ScreenSpot-Pro and 69.1% on OS-World-G. It outperforms

5

19

75

Alex Shaw

@alexgshaw

3 days

Today, we’re announcing the next chapter of Terminal-Bench with two releases: 1. Harbor, a new package for running sandboxed agent rollouts at scale 2. Terminal-Bench 2.0, a harder version of Terminal-Bench with increased verification

21

64

319

John Yang

@jyangballin

5 days

New eval! Code duels for LMs ⚔️ Current evals test LMs on *tasks*: "fix this bug," "write a test" But we code to achieve *goals*: maximize revenue, cut costs, win users Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals

26

89

364

Alex Shaw

@alexgshaw

4 months

Evaluating agents on benchmarks is a pain. Each benchmark comes with its own harness, scoring scripts, and environments and integrating can take days. We're introducing the Terminal-Bench dataset registry to solve this problem. Think of it as the npm of agent benchmarks. Now

1

24

104

Ludwig Schmidt

@lschmidt3

5 months

I'm a big fan of the approach to research funding @andykonwinski and the Laude team are taking! Working with them on terminal-bench has been fantastic (thanks @alexgshaw!) and I'm excited that they're going to support more open, impact-oriented research.

Andy Konwinski

@andykonwinski

5 months

Today, I’m launching a deeply personal project. I’m betting $100M that we can help computer scientists create more upside impact for humanity. Built for and by researchers, including @JeffDean & @jpineau1 on the board, @LaudeInstitute catalyzes research with real-world impact.

2

5

91

Thao Nguyen

@thao_nguyen26

5 months

Web data, the “fossil fuel of AI”, is being exhausted. What’s next?🤔 We propose Recycling the Web to break the data wall of pretraining via grounded synthetic data. It is more effective than standard data filtering methods, even with multi-epoch repeats! https://t.co/eBX0P4Sj7a

14

65

226

Ludwig Schmidt

@lschmidt3

5 months

More details on https://t.co/r7CnIpGTNl, Ryan’s thread below, and the paper itself https://t.co/6BjPBCpXbv https://t.co/lnLL9mUZak

openthoughts.ai

Pushing the boundaries of open reasoning datasets through rigorous experimentation.

Ryan Marten

@ryanmart3n

5 months

Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals. We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data

0

3

31

Ludwig Schmidt

@lschmidt3

5 months

Together with the paper we also release our new dataset OpenThoughts3-1.2M and the corresponding model OpenThinker3-7B, which is currently the best open-data 7B reasoning model.

1

0

25

Ludwig Schmidt

@lschmidt3

5 months

Similar to previous DataComp projects, we systematically experiment with every step of the data generation pipeline to build a state-of-the-art training set. Overall we conducted more than 1,000 individual experiments.

1

0

32

Ludwig Schmidt

@lschmidt3

5 months

Very excited to finally release our paper for OpenThoughts! After DataComp and DCLM, this is the third large open dataset my group has been building in collaboration with the DataComp community. This time, the focus is on post-training, specifically reasoning data.

22

207

1K

Ryan Marten

@ryanmart3n

5 months

Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals. We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data

33

195

928

Ludwig Schmidt

@lschmidt3

5 months

Cool to see more work on data for AI agents!

Alex Ratner

@ajratner

6 months

Agentic AI will transform every enterprise–but only if agents are trusted experts. The key: Evaluation & tuning on specialized, expert data. I’m excited to announce two new products to support this–@SnorkelAI Evaluate & Expert Data-as-a-Service–along w/ our $100M Series D! ---

0

13

Percy Liang

@percyliang

6 months

What would truly open-source AI look like? Not just open weights, open code/data, but *open development*, where the entire research and development process is public *and* anyone can contribute. We built Marin, an open lab, to fulfill this vision:

55

218

1K

Ludwig Schmidt

@lschmidt3

6 months

Very excited about our new agent benchmark! I think it's a nice way of evaluating how well agents can do complex task in terminal (command line) environments.

Mike A. Merrill

@Mike_A_Merrill

6 months

Many agents (Claude Code, Codex CLI) interact with the terminal to do valuable tasks, but do they currently work well enough to deploy en masse? We’re excited to introduce Terminal-Bench: An evaluation environment and benchmark for AI agents on real-world terminal tasks. Tl;dr

2

5

79

Mike A. Merrill

@Mike_A_Merrill

6 months

Many agents (Claude Code, Codex CLI) interact with the terminal to do valuable tasks, but do they currently work well enough to deploy en masse? We’re excited to introduce Terminal-Bench: An evaluation environment and benchmark for AI agents on real-world terminal tasks. Tl;dr

16

65

243

John Yang

@jyangballin

6 months

40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we’re open-sourcing the toolkit that made it happen: SWE-smith.

24

142

657

Thao Nguyen

@thao_nguyen26

6 months

📢 Announcing our data-centric workshop at ICML 2025 on unifying data curation frameworks across domains! 📅 Deadline: May 24, AoE 🔗 Website: https://t.co/K3U540rqoe We have an amazing lineup of speakers + panelists from various institutions and application areas.

2

26

135

Stanford AI Lab

@StanfordAILab

7 months

SAIL is still accepting applications for the SAIL Postdoctoral Fellowships! This is an opportunity to work with our wonderful professors and community. Applications received by the end of April 30 will receive full consideration:

1

21

13

Stanford NLP Group

@stanfordnlp

7 months

Want to learn the engineering details of building state-of-the-art Large Language Models (LLMs)? Not finding much info in @OpenAI’s non-technical reports? @percyliang and @tatsu_hashimoto are here to help with CS336: Language Modeling from Scratch, now rolling out to YouTube.

10

163

1K