Evan Wang
@evanzwangg
Followers
918
Following
822
Media
7
Statuses
29
post-training/reasoning @xAI prev @Caltech @scale_AI @weHRTyou @umdcs
Joined March 2020
you love to see it 🐦⬛
Introducing Grok 4 Fast, a multimodal reasoning model with a 2M context window that sets a new standard for cost-efficient intelligence. Available for free on https://t.co/AnXpIEOhOD,
https://t.co/53pltypvkw, iOS and Android apps, and OpenRouter. https://t.co/3YZ1yVwueV
0
0
12
🐆 💨 we do a bit of coding
Introducing Grok Code Fast 1, a speedy and economical reasoning model that excels at agentic coding. Now available for free on GitHub Copilot, Cursor, Cline, Kilo Code, Roo Code, opencode, and Windsurf. https://t.co/3tMbmLbxOP
13
7
235
LLMs are being deployed in high-stakes environments—and the potential impact of failure is colossal. A jailbroken AI could leak your customer data, financial records, or enable catastrophically harmful actions. At @gen_analysis we have compiled the definitive guide to understand
6
27
72
Delighted to announce that PlanSearch has been accepted to ICLR 2025!! 😁😁 see you in singapore 🫡
A 20% boost on a metric is rare, especially when it’s code generation 🥱 PlanSearch, our new search method based on diverse plans, outperforms baselines by huge margins. It's not just a search method, but also a philosophy How are these numbers achieved? can they be predicted?
5
9
57
thanks for having me!! was great seeing everyone again 😁
Our very own @evanzwangg visited us back at UMD today and gave an awesome talk. Check out his paper here to see how planning improves pass@k significantly for coding problems: https://t.co/rE505uLqsS
1
0
11
RLHF and instruction tuning reduce diversity in LLM output, limiting the value of inference-time search. PlanSearch, from research at @scale_AI, restores this diversity using combinatorial samples of "observations" to form plans for coding problems, yielding strong gains across
Enabling LLMs to reason more deeply at inference time via search is one of the most exciting directions in AI right now. We introduce PlanSearch, a novel method for code generation that searches over high-level "plans" in natural language as a means of encouraging diversity.
1
15
112
New SOTA test-time compute result from Scale SEAL⚡️ We are releasing a new SOTA test-time compute method called PlanSearch. It meaningfully outperforms existing approaches on LiveCodeBench via a new diversity-based search method See more about our SEAL open research below:
Enabling LLMs to reason more deeply at inference time via search is one of the most exciting directions in AI right now. We introduce PlanSearch, a novel method for code generation that searches over high-level "plans" in natural language as a means of encouraging diversity.
11
25
208
Enabling LLMs to reason more deeply at inference time via search is one of the most exciting directions in AI right now. We introduce PlanSearch, a novel method for code generation that searches over high-level "plans" in natural language as a means of encouraging diversity.
16
99
637
@hughbzhang @ellev3n11 @squeakymouse777 @vaskar_n @SeanHendryx @summeryue0 @scale_AI
https://t.co/S0T7HVYSIK 📝
arxiv.org
While scaling training compute has led to remarkable improvements in large language models (LLMs), scaling inference compute has not yet yielded analogous gains. We hypothesize that a core missing...
2
3
21
@hughbzhang @ellev3n11 @squeakymouse777 @vaskar_n @SeanHendryx @summeryue0 This was all possible through collaboration with @hughbzhang, @ellev3n11 , @squeakymouse777, Yunfeng, Will, @vaskar_n, Ziwen, @SeanHendryx , @summeryue0 and of course @scale_AI was a great summer and would gladly do it again!
1
0
12
@hughbzhang @ellev3n11 @squeakymouse777 @vaskar_n @SeanHendryx @summeryue0 We found that current models lack diversity out-of-the-box, making effective inference-time compute hard. Searching in idea space somewhat alleviates this issue. In the long term, we imagine combining these immense p@k gains with training to distill the gains into p@1, natively😇
1
0
6
@hughbzhang @ellev3n11 @squeakymouse777 @vaskar_n @SeanHendryx @summeryue0 Finally, even though we optimize our methods to be ‘attempt-efficient’ (if you had 2 attempts, how would you make these attempts as good as possible), we check compute-efficiency as well even though we use 6.5x as many generated tokens, PlanSearch still scales better 📈
1
1
13
@hughbzhang @ellev3n11 @squeakymouse777 @vaskar_n @SeanHendryx @summeryue0 Even a simple filtering like submitting only those passing public tests brings p@8k -> p@k, which is HUGE… So p@1 of filtering = p@8 with search ✅ Another example: base models much more diverse than instruct. The paradigm on the base model p@1 is much better than instruct p@1.
2
0
10
@hughbzhang @ellev3n11 @squeakymouse777 @vaskar_n @SeanHendryx @summeryue0 These giant improvements at large k can be BROUGHT BACK to low k through filtering, which picks promising sols from a pool of sols. We argue for a paradigm that optimizes diversity to sacrifice p@1 for huge p@k gains, then uses filtering to bring those p@k gains back to low k
1
1
17
@hughbzhang @ellev3n11 @squeakymouse777 @vaskar_n @SeanHendryx @summeryue0 this is how we get so much diversity 🤩 Even though we may sacrifice our pass@1 a small bit, our pass@k is much, much better. Our best model gets almost DOUBLE the raw pass@1, and drastically outperforms other baselines like CoT
1
0
15
@hughbzhang @ellev3n11 @squeakymouse777 @vaskar_n @SeanHendryx @summeryue0 objectives like RLHF are known to reduce diversity at train-time. we inject back more diversity through PlanSearch. how it works: We generate layer 1 of observations, and selectively mix these to create the next layer. These generate the solution sketches, and then the code.
1
1
29