Evan Wang @evanzwangg X Profile

Evan Wang

@evanzwangg

Followers

918

Following

822

Media

7

Statuses

29

post-training/reasoning @xAI prev @Caltech @scale_AI @weHRTyou @umdcs

Joined March 2020

Don't wanna be here? Send us removal request.

Evan Wang

@evanzwangg

2 months

you love to see it 🐦‍⬛

xAI

@xai

2 months

Introducing Grok 4 Fast, a multimodal reasoning model with a 2M context window that sets a new standard for cost-efficient intelligence. Available for free on https://t.co/AnXpIEOhOD, https://t.co/53pltypvkw, iOS and Android apps, and OpenRouter. https://t.co/3YZ1yVwueV

0

12

Evan Wang

@evanzwangg

3 months

🐆 💨 we do a bit of coding

xAI

@xai

3 months

Introducing Grok Code Fast 1, a speedy and economical reasoning model that excels at agentic coding. Now available for free on GitHub Copilot, Cursor, Cline, Kilo Code, Roo Code, opencode, and Windsurf. https://t.co/3tMbmLbxOP

13

7

235

Evan Wang

@evanzwangg

4 months

✌️

Daniel

@nearlydaniel

4 months

War Room squad locked in

7

1

26

Evan Wang

@evanzwangg

4 months

good stuff grok 🚀 https://t.co/4mfdh8X01S

0

1

16

Rez Havaei

@HavaeiRez

8 months

LLMs are being deployed in high-stakes environments—and the potential impact of failure is colossal. A jailbroken AI could leak your customer data, financial records, or enable catastrophically harmful actions. At @gen_analysis we have compiled the definitive guide to understand

6

27

72

Evan Wang

@evanzwangg

10 months

Delighted to announce that PlanSearch has been accepted to ICLR 2025!! 😁😁 see you in singapore 🫡

Evan Wang

@evanzwangg

1 year

A 20% boost on a metric is rare, especially when it’s code generation 🥱 PlanSearch, our new search method based on diverse plans, outperforms baselines by huge margins. It's not just a search method, but also a philosophy How are these numbers achieved? can they be predicted?

5

9

57

Evan Wang

@evanzwangg

1 year

thanks for having me!! was great seeing everyone again 😁

Furong Huang

@furongh

1 year

Our very own @evanzwangg visited us back at UMD today and gave an awesome talk. Check out his paper here to see how planning improves pass@k significantly for coding problems: https://t.co/rE505uLqsS

1

0

11

Riley Goodside

@goodside

1 year

RLHF and instruction tuning reduce diversity in LLM output, limiting the value of inference-time search. PlanSearch, from research at @scale_AI, restores this diversity using combinatorial samples of "observations" to form plans for coding problems, yielding strong gains across

Hugh Zhang

@hughbzhang

1 year

Enabling LLMs to reason more deeply at inference time via search is one of the most exciting directions in AI right now. We introduce PlanSearch, a novel method for code generation that searches over high-level "plans" in natural language as a means of encouraging diversity.

1

15

112

Alexandr Wang

@alexandr_wang

1 year

New SOTA test-time compute result from Scale SEAL⚡️ We are releasing a new SOTA test-time compute method called PlanSearch. It meaningfully outperforms existing approaches on LiveCodeBench via a new diversity-based search method See more about our SEAL open research below:

Hugh Zhang

@hughbzhang

1 year

Enabling LLMs to reason more deeply at inference time via search is one of the most exciting directions in AI right now. We introduce PlanSearch, a novel method for code generation that searches over high-level "plans" in natural language as a means of encouraging diversity.

11

25

208

Hugh Zhang

@hughbzhang

1 year

Enabling LLMs to reason more deeply at inference time via search is one of the most exciting directions in AI right now. We introduce PlanSearch, a novel method for code generation that searches over high-level "plans" in natural language as a means of encouraging diversity.

16

99

637

Evan Wang

@evanzwangg

1 year

@hughbzhang @ellev3n11 @squeakymouse777 @vaskar_n @SeanHendryx @summeryue0 @scale_AI https://t.co/S0T7HVYSIK 📝

arxiv.org

While scaling training compute has led to remarkable improvements in large language models (LLMs), scaling inference compute has not yet yielded analogous gains. We hypothesize that a core missing...

2

3

21

Evan Wang

@evanzwangg

1 year

@hughbzhang @ellev3n11 @squeakymouse777 @vaskar_n @SeanHendryx @summeryue0 This was all possible through collaboration with @hughbzhang, @ellev3n11 , @squeakymouse777, Yunfeng, Will, @vaskar_n, Ziwen, @SeanHendryx , @summeryue0 and of course @scale_AI was a great summer and would gladly do it again!

1

0

12

Evan Wang

@evanzwangg

1 year

@hughbzhang @ellev3n11 @squeakymouse777 @vaskar_n @SeanHendryx @summeryue0 We found that current models lack diversity out-of-the-box, making effective inference-time compute hard. Searching in idea space somewhat alleviates this issue. In the long term, we imagine combining these immense p@k gains with training to distill the gains into p@1, natively😇

1

0

6

Evan Wang

@evanzwangg

1 year

@hughbzhang @ellev3n11 @squeakymouse777 @vaskar_n @SeanHendryx @summeryue0 Finally, even though we optimize our methods to be ‘attempt-efficient’ (if you had 2 attempts, how would you make these attempts as good as possible), we check compute-efficiency as well even though we use 6.5x as many generated tokens, PlanSearch still scales better 📈

1

13

Evan Wang

@evanzwangg

1 year

@hughbzhang @ellev3n11 @squeakymouse777 @vaskar_n @SeanHendryx @summeryue0 Even a simple filtering like submitting only those passing public tests brings p@8k -> p@k, which is HUGE… So p@1 of filtering = p@8 with search ✅ Another example: base models much more diverse than instruct. The paradigm on the base model p@1 is much better than instruct p@1.

2

0

10

Evan Wang

@evanzwangg

1 year

@hughbzhang @ellev3n11 @squeakymouse777 @vaskar_n @SeanHendryx @summeryue0 These giant improvements at large k can be BROUGHT BACK to low k through filtering, which picks promising sols from a pool of sols. We argue for a paradigm that optimizes diversity to sacrifice p@1 for huge p@k gains, then uses filtering to bring those p@k gains back to low k

1

17

Evan Wang

@evanzwangg

1 year

@hughbzhang @ellev3n11 @squeakymouse777 @vaskar_n @SeanHendryx @summeryue0 this is how we get so much diversity 🤩 Even though we may sacrifice our pass@1 a small bit, our pass@k is much, much better. Our best model gets almost DOUBLE the raw pass@1, and drastically outperforms other baselines like CoT

1

0

15

Evan Wang

@evanzwangg

1 year

@hughbzhang @ellev3n11 @squeakymouse777 @vaskar_n @SeanHendryx @summeryue0 objectives like RLHF are known to reduce diversity at train-time. we inject back more diversity through PlanSearch. how it works: We generate layer 1 of observations, and selectively mix these to create the next layer. These generate the solution sketches, and then the code.

1

29