
Zengzhi Wang
@SinclairWang1
Followers
2K
Following
7K
Media
97
Statuses
2K
PhDing @sjtu1896 #NLProc Working on Data Engineering for LLMs: MathPile (2023), 🫐 ProX (2024), 💎 MegaMath (2025),🐙 OctoThinker(2025)
Joined November 2020
Can not agree more!.
I don't think we need an American DeepSeek Project, we need an Open-Data DeepSeek. And no we didn't get one yet, despite what you might think, so let me explain. The biggest contributor to the gap between closed-source and open-source AI is, in my opinion, data accessibility and
0
0
3
RT @dirctd_by_beens: blog - read 'octothinker' last week and it's so cool. great work by @SinclairWang1 @FaZhou_99….
0
4
0
RT @soldni: @EMostaque @natolambert processing all of CommonCrawl is about $20-50k [0], plus maybe 10-50k H100 if you wanna do GPU classifi….
0
1
0
RT @sivil_taram: Training end-to-end multi-turn tool-use agents has proven incredibly challenging 😤 Just as noted in the recent Kevin blog:….
0
8
0
RT @aaron_defazio: Why do gradients increase near the end of training? .Read the paper to find out!.We also propose a simple fix to AdamW t….
0
74
0
RT @code_star: Amazing work (once again). Better midtraining makes models better for RL. Once again the power of good data strikes again.….
0
3
0
RT @stefan_fee: What foundation models do we REALLY need for the RL era? And what pre-training data?. Excited to share our work: OctoThinke….
0
9
0
RT @gui_penedo: We have finally released the 📝paper for 🥂FineWeb2, our large multilingual pre-training dataset. Along with general (and ex….
0
93
0
I believe our work gives a preliminary definition for mid-training. Feel free to cite it along with these listed references.
What Makes a Base Language Model Suitable for RL?. Rumors in the community say RL (i.e., RLVR) on LLMs is full of “mysteries”:. (1) Is the magic only happening on Qwen + Math?.(2) Does the "aha moment" only spark during math reasoning?.(3) Is evaluation hiding some tricky traps?
1
7
69
Just ready, feel free to download MegaMath-Web-Pro-Max right now!.
Say hi to 🔮MegaMath-Pro-Max. High-quality corpora are vital for mid-training. When it comes to the math domain?. Let me tell you the behind recipe. 1. Curating Pipeline. Step 1: uniformly and randomly sample millions of documents from the MegaMath-Web corpus, stratified by
0
1
10
Just ready, please download the data right now!.
What Makes a Base Language Model Suitable for RL?. Rumors in the community say RL (i.e., RLVR) on LLMs is full of “mysteries”:. (1) Is the magic only happening on Qwen + Math?.(2) Does the "aha moment" only spark during math reasoning?.(3) Is evaluation hiding some tricky traps?
1
5
50
RT @joemelko: More evidence of the importance of high quality mid(pre) training data to create a base for rl. Cool paper.
0
2
0
RT @MichelIvan92347: Details on the new corpora for mid-training* 👇. *The OctoThinker paper is worth reading as already stated here.
0
1
0