wzhao_nlp Profile Banner
Wenting Zhao Profile
Wenting Zhao

@wzhao_nlp

Followers
3K
Following
336
Media
35
Statuses
416

reasoning & llms

NYC
Joined June 2013
Don't wanna be here? Send us removal request.
@wzhao_nlp
Wenting Zhao
20 hours
RT @ericzelikman: i've been thinking lately about how future ai systems will interact with us and how we can make systems that care about p….
0
13
0
@wzhao_nlp
Wenting Zhao
2 days
RT @haozhangml: @wzhao_nlp This aligns with lmgame-bench results:
0
1
0
@wzhao_nlp
Wenting Zhao
2 days
this is so cool, we should also have a human player in there.
@kaggle
Kaggle
3 days
What a show! The Kaggle Game Arena AI Chess Exhibition Tournament is complete, and the winner is O3 🏆! . A huge thank you to everyone who tuned in and to our amazing partners @MagnusCarlsen, @GMHikaru, @GothamChess and @DavidHowellGM for the fantastic commentary and analysis on
Tweet media one
3
0
4
@wzhao_nlp
Wenting Zhao
20 days
also what about healthy runs? i guess mfu for sure, but do you look at others?.
0
0
6
@wzhao_nlp
Wenting Zhao
20 days
Silly but important question: what metrics do you look at / how to vibe-check your training runs are going well, specially under the context of RL/GRPO? Rewards, response lengths, entropy, what more?.
14
9
308
@wzhao_nlp
Wenting Zhao
23 days
I'll be around the ICML venue this afternoon. Message me if you want to meet! These days, I think about reasoning and RL. Also happy to talk about academia vs. industry (I think the lack of compute in academia is a feature not a bug), faculty and PhD student recruiting at UMass.
0
5
118
@wzhao_nlp
Wenting Zhao
26 days
RT @justintchiu: haven't made a new blog post in over a year, so here's a new one: it's short.
justintchiu.com
RL is better than SFT
0
22
0
@wzhao_nlp
Wenting Zhao
1 month
RT @yorambac: AI Research Agents are becoming proficient at machine learning tasks, but how can we help them search the space of candidate….
0
68
0
@wzhao_nlp
Wenting Zhao
1 month
RT @michahu8: 📢 today's scaling laws often don't work for predicting downstream task performance. For some pretraining setups, smooth and p….
0
37
0
@wzhao_nlp
Wenting Zhao
1 month
RT @ori_press: Do language models have algorithmic creativity?. To find out, we built AlgoTune, a benchmark challenging agents to optimize….
0
59
0
@wzhao_nlp
Wenting Zhao
1 month
RT @_jasonwei: We don’t have AI self-improves yet, and when we do it will be a game-changer. With more wisdom now compared to the GPT-4 day….
0
168
0
@wzhao_nlp
Wenting Zhao
1 month
Congrats to team! They built my dream benchmark.
@MinqiJiang
Minqi Jiang
1 month
Recently, there has been a lot of talk of LLM agents automating ML research itself. If Llama 5 can create Llama 6, then surely the singularity is just around the corner. How can we get a pulse check on whether current LLMs are capable of driving this kind of total
Tweet media one
0
0
11
@wzhao_nlp
Wenting Zhao
1 month
RT @NovaSkyAI: ✨Release: We upgraded SkyRL into a highly-modular, performant RL framework for training LLMs. We prioritized modularity—easi….
0
44
0
@wzhao_nlp
Wenting Zhao
1 month
Dang, truly impressed by how an academic lab just figured out a lot of mysteries in mid-training to close the RL gap between llama and qwen:.* length scheduler plays a key role to stabilize RL.* there is some dark magic in prompt template?.* the data interaction stuff is really.
@SinclairWang1
Zengzhi Wang
1 month
What Makes a Base Language Model Suitable for RL?. Rumors in the community say RL (i.e., RLVR) on LLMs is full of “mysteries”:. (1) Is the magic only happening on Qwen + Math?.(2) Does the "aha moment" only spark during math reasoning?.(3) Is evaluation hiding some tricky traps?
Tweet media one
Tweet media two
Tweet media three
Tweet media four
3
16
196
@wzhao_nlp
Wenting Zhao
2 months
LM training bottlenecks.2024: code RL -> code execution is slower than model inference.2025: reasoning model RL -> rolling out 32k tokens takes forever. maybe diffusion models are indeed the solution lol.
2
1
109
@wzhao_nlp
Wenting Zhao
2 months
It's time to think about code generation beyond functional correctness. Refactoring multiple libraries requires designing APIs that support past and future use cases, which is challenging for even human engineers. Can't wait for LLMs to unify pytorch, tensorflow, and jax 😬.
@justintchiu
Justin T Chiu
2 months
Are code agents good at software design, ie building general and reusable code?.We present Librarian, a new refactoring method, and MiniCode, a verifiable refactoring benchmark that requires agents to design libraries that jointly minimizes code from multiple repos 🧵
1
4
48
@wzhao_nlp
Wenting Zhao
2 months
The more I dive into LM training, the more I feel pretraining is just starting. Some questions I’m particularly interested in:.* what data unlocks what capabilities?.* do we train on capabilities sequentially or in parallel?.* how many synthetic examples is a human example worth?.
@karpathy
Andrej Karpathy
2 months
Mildly obsessed with what the "highest grade" pretraining data stream looks like for LLM training, if 100% of the focus was on quality, putting aside any quantity considerations. Guessing something textbook-like content, in markdown? Or possibly samples from a really giant model?.
8
27
334
@wzhao_nlp
Wenting Zhao
2 months
That’s the vision of commit0: there is nearly zero improvement on this benchmark in the past few months. I don’t think this problem is solvable in 24 months….
Tweet card summary image
github.com
Commit0: Library Generation from Scratch. Contribute to commit-0/commit0 development by creating an account on GitHub.
@jsnnsa
jacob╞
2 months
cursor is a $100M business that will be worth $0 in 24 months. not because they built wrong - they built perfectly. but they built a sail for a race that's about to end. when AI just writes entire codebases, even the best IDE becomes irrelevant.
1
1
19
@wzhao_nlp
Wenting Zhao
2 months
RT @AlexGDimakis: There are still posts about 'new papers showing AI models cannot reason'. There are unfortunately problems into how these….
0
19
0
@wzhao_nlp
Wenting Zhao
2 months
RT @gneubig: Where does one language model outperform the other?. We examine this from first principles, performing unsupervised discovery….
0
11
0