Alex Wettig
@_awettig
Followers
2K
Following
2K
Media
23
Statuses
200
phd @Princeton / training agents @cursor_ai
Joined July 2022
Who uses AI agents? How do agents impact output? How might agents change work patterns? New working paper studies usage + impacts of coding agents (1/n)
5
42
189
I'm really excited about our new paper!! ๐ฃ 'Reinforcement Learning Improves Traversal of Hierarchical Knowledge in LLMs' Contrary to belief that RL ft degrades memorized knowledge, RL-enhanced models consistently outperform base/SFT on knowledge recall by 24pp! RL teaches
13
48
400
How CodeClash works: Two LMs enter a tournament. Each maintains its own codebase. Every round: 1. Edit Phase: LMs modify their codebases however they like 2. Competition phase: Codebases battle in an arena. 3. Repeat The LM that wins the majority of rounds is declared winner.
1
1
36
Made a joke app for when people ask questions that @cursor_ai can answer:
1
4
22
We did a thing! https://t.co/izBxFH4DjO
2
4
118
Composer is a new model we built at Cursor. We used RL to train a big MoE model to be really good at real-world coding, and also very fast. https://t.co/DX9bbalx0B Excited for the potential of building specialized models to help in critical domains.
56
76
794
composer is back, and its our first coding model trained in house. try it out in cursor 2.0 with best-of-n, worktrees and browser. so excited to get this out, team has been working incredibly hard to make it happen. as always, curious to hear what you think!
119
49
1K
>be me >be Claude >have read the internet but one day human asks me to draw >no training, no practice, just converting mental image to mouse movements like a toddler holding a crayon >pencil tool not working? np, I'll draw with the eraser >task failed successfully
5
7
236
Cursor can now control your browser. Agent can take screenshots, improve UI, and debug client issues. Try our early preview with Sonnet 4.5.
246
521
6K
We evaluated Anthropic's Sonnet 4.5 with our minimal agent. New record on SWE-bench verified: 70.6%! Same price/token as Sonnet 4, but takes more steps, ending up being more expensive. Cost analysis details & link to full trajectories in ๐งต
4
14
85
yolo run summer is over scaling laws fall has arrived
1
1
63
MoE layers can be really slow. When training our coding models @cursor_ai, they ate up 27โ53% of training time. So we completely rebuilt it at the kernel level and transitioned to MXFP8. The result: 3.5x faster MoE layer and 1.5x end-to-end training speedup. We believe our
29
105
882
Presenting two posters at ICML over the next two days: - Both at 11am - 1:30pm - Both about how to improve pre-training with domains - Both at stall # E-2600 in East Exhibition Hall A-B (!) Tomorrow: WebOrganizer w/ @soldni & @kylelostat Thursday: MeCo by @gaotianyu1350
1
11
51
Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
61
197
1K
Anthropic staff realized they could ask Claude to buy things that werenโt just food & drink. After someone randomly decided to ask it to order a tungsten cube, Claude ended up with an inventory full of (as it put it) โspecialty metal itemsโ that it ended up selling at a loss.
64
210
4K
New paper cutting through the thicket of KV cache eviction methods!
There are many KV cache-reduction methods, but a fair comparison is challenging. We propose a new unified metric called โcritical KV footprintโ. We compare existing methods and propose a new one - PruLong, which โprunesโ certain attn heads to only look at local tokens. 1/7
0
1
17
Can GPT, Claude, and Gemini play video games like Zelda, Civ, and Doom II? ๐ฉ๐ถ๐ฑ๐ฒ๐ผ๐๐ฎ๐บ๐ฒ๐๐ฒ๐ป๐ฐ๐ต evaluates VLMs on Game Boy & MS-DOS games given only raw screen input, just like how a human would play. The best model (Gemini) completes just 0.48% of the benchmark! ๐งต๐
23
78
560
Claude Sonnet 4 is much better at codebase understanding. Paired with recent improvements in Cursor, it's SOTA on large codebases
32
43
862