Alex Wettig @_awettig X Profile

Alex Wettig

@_awettig

Followers

2K

Following

2K

Media

22

Statuses

187

PhD @Princeton trying to make sense of language models and their training data; trying to train agents @cursor_ai

https://t.co/EQxNzaz67l

Joined July 2022

Don't wanna be here? Send us removal request.

Charlie Snell

@sea_snell

11 days

yolo run summer is over scaling laws fall has arrived

Preet Sojitra

@Preet_Sojitra03

7 months

@sea_snell Just one more

1

64

Jessy Lin

@realJessyLin

18 days

🔍 How do we teach an LLM to 𝘮𝘢𝘴𝘵𝘦𝘳 a body of knowledge? In new work with @AIatMeta, we propose Active Reading 📙: a way for models to teach themselves new things by self-studying their training data. Results: * 𝟔𝟔% on SimpleQA w/ an 8B model by studying the wikipedia

15

159

1K

Stuart Sul

@stuart_sul

26 days

MoE layers can be really slow. When training our coding models @cursor_ai, they ate up 27–53% of training time. So we completely rebuilt it at the kernel level and transitioned to MXFP8. The result: 3.5x faster MoE layer and 1.5x end-to-end training speedup. We believe our

30

103

866

Alex Wettig

@_awettig

2 months

Presenting two posters at ICML over the next two days: - Both at 11am - 1:30pm - Both about how to improve pre-training with domains - Both at stall # E-2600 in East Exhibition Hall A-B (!) Tomorrow: WebOrganizer w/ @soldni & @kylelostat Thursday: MeCo by @gaotianyu1350

1

9

49

Albert Gu

@_albertgu

2 months

Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.

Sukjun (June) Hwang

@sukjun_hwang

2 months

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data

61

192

1K

Anthropic

@AnthropicAI

3 months

Anthropic staff realized they could ask Claude to buy things that weren’t just food & drink. After someone randomly decided to ask it to order a tungsten cube, Claude ended up with an inventory full of (as it put it) “specialty metal items” that it ended up selling at a loss.

65

211

4K

Alex Wettig

@_awettig

3 months

New paper cutting through the thicket of KV cache eviction methods!

Adithya Bhaskar

@AdithyaNLP

3 months

There are many KV cache-reduction methods, but a fair comparison is challenging. We propose a new unified metric called “critical KV footprint”. We compare existing methods and propose a new one - PruLong, which “prunes” certain attn heads to only look at local tokens. 1/7

0

1

17

Alex Zhang

@a1zhang

4 months

Can GPT, Claude, and Gemini play video games like Zelda, Civ, and Doom II? 𝗩𝗶𝗱𝗲𝗼𝗚𝗮𝗺𝗲𝗕𝗲𝗻𝗰𝗵 evaluates VLMs on Game Boy & MS-DOS games given only raw screen input, just like how a human would play. The best model (Gemini) completes just 0.48% of the benchmark! 🧵👇

23

77

544

Aman Sanger

@amanrsanger

4 months

Claude Sonnet 4 is much better at codebase understanding. Paired with recent improvements in Cursor, it's SOTA on large codebases

32

44

856

Kilian Lieret

@KLieret

4 months

Massive gains with Sonnet 4 on SWE-agent: Single-attempt pass@1 rises to 69% on SWE-bench Verified! Sonnet 4 iterates longer (making it slightly more expensive) but almost never gets stuck. Localization ability appears unchanged, but quality of edits improves.

4

14

84

Ofir Press

@OfirPress

4 months

Great results from the Claude team- the 80% result is pass@1!! They ran the model in parallel multiple times and had an LM judge pick the best patch to submit.

5

7

120

Alex Wettig

@_awettig

4 months

Big arrow time! We can make huge progress on open-source SWE agents by scaling up the creation of virtual coding environments 🚀

John Yang

@jyangballin

4 months

40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we’re open-sourcing the toolkit that made it happen: SWE-smith.

0

4

17

Cursor

@cursor_ai

4 months

Cursor is now free for students. Enjoy!

2K

4K

42K

Xindi Wu

@cindy_x_wu

4 months

Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦 https://t.co/j3WphP7QfY 1/10

6

47

160

John Yang

@jyangballin

4 months

@ weekend warriors - DM me a GitHub repo that you like / maintain, and I'll train you a 7B coding agent that's an expert for that repo. Main constraints - it's predominantly Python, and has a testing suite w/ good coverage. (example of good repo = sympy, pandas, sqlfluff)

19

8

116

Jacob Springer

@jacspringer

6 months

Training with more data = better LLMs, right? 🚨 False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇 1/9

17

184

820

Alisa Liu

@alisawuffles

6 months

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words. When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

94

326

3K

Logan Engstrom

@logan_engstrom

6 months

Want state-of-the-art data curation, data poisoning & more? Just do gradient descent! w/ @andrew_ilyas Ben Chen @axel_s_feldmann @wsmoses @aleks_madry: we show how to optimize final model loss wrt any continuous variable. Key idea: Metagradients (grads through model training)

9

32

176

Zhiyuan Zeng

@ZhiyuanZeng_

6 months

Is a single accuracy number all we can get from model evals?🤔 🚨Does NOT tell where the model fails 🚨Does NOT tell how to improve it Introducing EvalTree🌳 🔍identifying LM weaknesses in natural language 🚀weaknesses serve as actionable guidance (paper&demo 🔗in🧵) [1/n]

5

92

265

Jeremy Bernstein

@jxbz

6 months

I just wrote my first blog post in four years! It is called "Deriving Muon". It covers the theory that led to Muon and how, for me, Muon is a meaningful example of theory leading practice in deep learning (1/11)

12

133

964