_awettig Profile Banner
Alex Wettig Profile
Alex Wettig

@_awettig

Followers
2K
Following
2K
Media
22
Statuses
187

PhD @Princeton trying to make sense of language models and their training data; trying to train agents @cursor_ai

Joined July 2022
Don't wanna be here? Send us removal request.
@sea_snell
Charlie Snell
11 days
yolo run summer is over scaling laws fall has arrived
@Preet_Sojitra03
Preet Sojitra
7 months
@sea_snell Just one more
Tweet media one
1
1
64
@realJessyLin
Jessy Lin
18 days
🔍 How do we teach an LLM to 𝘮𝘢𝘴𝘵𝘦𝘳 a body of knowledge? In new work with @AIatMeta, we propose Active Reading 📙: a way for models to teach themselves new things by self-studying their training data. Results: * 𝟔𝟔% on SimpleQA w/ an 8B model by studying the wikipedia
Tweet media one
15
159
1K
@stuart_sul
Stuart Sul
26 days
MoE layers can be really slow. When training our coding models @cursor_ai, they ate up 27–53% of training time. So we completely rebuilt it at the kernel level and transitioned to MXFP8. The result: 3.5x faster MoE layer and 1.5x end-to-end training speedup. We believe our
Tweet media one
30
103
866
@_awettig
Alex Wettig
2 months
Presenting two posters at ICML over the next two days: - Both at 11am - 1:30pm - Both about how to improve pre-training with domains - Both at stall # E-2600 in East Exhibition Hall A-B (!) Tomorrow: WebOrganizer w/ @soldni & @kylelostat Thursday: MeCo by @gaotianyu1350
Tweet media one
1
9
49
@_albertgu
Albert Gu
2 months
Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.
Tweet media one
@sukjun_hwang
Sukjun (June) Hwang
2 months
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
Tweet media one
Tweet media two
61
192
1K
@AnthropicAI
Anthropic
3 months
Anthropic staff realized they could ask Claude to buy things that weren’t just food & drink. After someone randomly decided to ask it to order a tungsten cube, Claude ended up with an inventory full of (as it put it) “specialty metal items” that it ended up selling at a loss.
Tweet media one
65
211
4K
@_awettig
Alex Wettig
3 months
New paper cutting through the thicket of KV cache eviction methods!
@AdithyaNLP
Adithya Bhaskar
3 months
There are many KV cache-reduction methods, but a fair comparison is challenging. We propose a new unified metric called “critical KV footprint”. We compare existing methods and propose a new one - PruLong, which “prunes” certain attn heads to only look at local tokens. 1/7
Tweet media one
0
1
17
@a1zhang
Alex Zhang
4 months
Can GPT, Claude, and Gemini play video games like Zelda, Civ, and Doom II? 𝗩𝗶𝗱𝗲𝗼𝗚𝗮𝗺𝗲𝗕𝗲𝗻𝗰𝗵 evaluates VLMs on Game Boy & MS-DOS games given only raw screen input, just like how a human would play. The best model (Gemini) completes just 0.48% of the benchmark! 🧵👇
23
77
544
@amanrsanger
Aman Sanger
4 months
Claude Sonnet 4 is much better at codebase understanding. Paired with recent improvements in Cursor, it's SOTA on large codebases
Tweet media one
32
44
856
@KLieret
Kilian Lieret
4 months
Massive gains with Sonnet 4 on SWE-agent: Single-attempt pass@1 rises to 69% on SWE-bench Verified! Sonnet 4 iterates longer (making it slightly more expensive) but almost never gets stuck. Localization ability appears unchanged, but quality of edits improves.
Tweet media one
4
14
84
@OfirPress
Ofir Press
4 months
Great results from the Claude team- the 80% result is pass@1!! They ran the model in parallel multiple times and had an LM judge pick the best patch to submit.
Tweet media one
5
7
120
@_awettig
Alex Wettig
4 months
Big arrow time! We can make huge progress on open-source SWE agents by scaling up the creation of virtual coding environments 🚀
@jyangballin
John Yang
4 months
40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synthesizing a ton of agentic training data from 100+ Python repos. Today we’re open-sourcing the toolkit that made it happen: SWE-smith.
Tweet media one
0
4
17
@cursor_ai
Cursor
4 months
Cursor is now free for students. Enjoy!
2K
4K
42K
@cindy_x_wu
Xindi Wu
4 months
Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦 https://t.co/j3WphP7QfY 1/10
Tweet media one
6
47
160
@jyangballin
John Yang
4 months
@ weekend warriors - DM me a GitHub repo that you like / maintain, and I'll train you a 7B coding agent that's an expert for that repo. Main constraints - it's predominantly Python, and has a testing suite w/ good coverage. (example of good repo = sympy, pandas, sqlfluff)
19
8
116
@jacspringer
Jacob Springer
6 months
Training with more data = better LLMs, right? 🚨 False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇 1/9
Tweet media one
17
184
820
@alisawuffles
Alisa Liu
6 months
We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words. When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵
Tweet media one
94
326
3K
@logan_engstrom
Logan Engstrom
6 months
Want state-of-the-art data curation, data poisoning & more? Just do gradient descent! w/ @andrew_ilyas Ben Chen @axel_s_feldmann @wsmoses @aleks_madry: we show how to optimize final model loss wrt any continuous variable. Key idea: Metagradients (grads through model training)
Tweet media one
9
32
176
@ZhiyuanZeng_
Zhiyuan Zeng
6 months
Is a single accuracy number all we can get from model evals?🤔 🚨Does NOT tell where the model fails 🚨Does NOT tell how to improve it Introducing EvalTree🌳 🔍identifying LM weaknesses in natural language 🚀weaknesses serve as actionable guidance (paper&demo 🔗in🧵) [1/n]
Tweet media one
Tweet media two
Tweet media three
5
92
265
@jxbz
Jeremy Bernstein
6 months
I just wrote my first blog post in four years! It is called "Deriving Muon". It covers the theory that led to Muon and how, for me, Muon is a meaningful example of theory leading practice in deep learning (1/11)
Tweet media one
12
133
964