_hanlin_zhang_ Profile Banner
Hanlin Zhang Profile
Hanlin Zhang

@_hanlin_zhang_

Followers
792
Following
590
Media
31
Statuses
92

CS PhD student @Harvard, @googleai

Joined September 2019
Don't wanna be here? Send us removal request.
@_hanlin_zhang_
Hanlin Zhang
8 months
Critical batch size is crucial for reducing the wall-clock time of large-scale training runs with data parallelism. We find that it depends primarily on data size. 🧵 [1/n].Paper 📑: Blog 📝:
1
12
122
@_hanlin_zhang_
Hanlin Zhang
6 days
[17/n] Final thoughts. EvoLM offers:. 🔓 100+ open LLMs. 📜 Controlled full-stage training. 📊 Evaluations across cloze, generative, ID/OOD tasks. 📦 Full code, data, and ongoing support. Kudos to the team @ZhentingQi, @FanNie1208, @AlexAlahi, @james_y_zou, @hima_lakkaraju,.
0
0
1
@_hanlin_zhang_
Hanlin Zhang
6 days
[16/n] Takeaway 1️⃣3️⃣. 📷.“ORM score could be a more reliable unsupervised validation metric that helps predict downstream task performance during post-training, compared to validation loss. Notably, ORM scores from an 8B reward model correlate well with problem-solving accuracies
Tweet media one
1
0
2
@_hanlin_zhang_
Hanlin Zhang
6 days
[15/n] Takeaway 1️⃣2️⃣. “Under a constrained downstream data budget, allocating more examples to SFT maximizes in-domain gains at the expense of weaker OOD generalization, while allocating more to RL improves OOD performance.”. With 100K total examples:.90K SFT + 10K RL = best ID
Tweet media one
1
0
2
@_hanlin_zhang_
Hanlin Zhang
6 days
[14/n] Takeaway 1️⃣1️⃣.“Beyond saturation regime, RL primarily increases the probability of sampling high-quality rollouts but may not necessarily improve models’ fundamental reasoning capabilities.”. RL amplifies confidence, not competence.
1
0
1
@_hanlin_zhang_
Hanlin Zhang
6 days
[13/n] Takeaway 🔟 -📷“RL with excessive epochs or examples improves downstream performance on both ID and OOD tasks, but with diminishing returns.”. We scale RL epochs and dataset sizes separately. Performance peaks at ~8 epochs or ~100K examples for 1B models. After that,
Tweet media one
1
0
2
@_hanlin_zhang_
Hanlin Zhang
6 days
[12/n] Takeaway 9️⃣ - 📷“Excessive SFT, especially overly large epochs, could limit further RL improvements.”. Once the model memorizes via SFT, RL has little room to further improve. → Overfitting in SFT bottlenecks RL. 🛑 Stop SFT early if planning RL.
Tweet media one
1
0
1
@_hanlin_zhang_
Hanlin Zhang
6 days
[11/n] Takeaway 8️⃣ - “Excessive SFT improves ID performance with diminishing returns but does not necessarily improve and can even degrade OOD performance.”. We scale both epochs (1–32) and dataset size (50K–400K):.ID metrics 📈.OOD metrics plateau or drop.⚖️ Balance SFT
Tweet media one
1
0
1
@_hanlin_zhang_
Hanlin Zhang
6 days
[10/n] Takeaway 7️⃣ - “With sufficient domain-specific CPT data, post-training on in-domain tasks not only improves in-domain performance but also generalizes effectively to OOD tasks.”. With enough CPT (e.g. 42B math tokens), post-trained models can generalize well to OOD.
1
0
1
@_hanlin_zhang_
Hanlin Zhang
6 days
[9/n] Takeaway 6️⃣ - “As domain-specific CPT data increase, in-domain downstream performance steadily improves, and the SFT models could benefit more from RL finetuning.”. Scaling CPT from 2B → 42B tokens = monotonic ID performance gains. Plus: .🟡 RL helps more when CPT is.
1
0
1
@_hanlin_zhang_
Hanlin Zhang
6 days
[8/n] Takeaway 5️⃣-📷 “Domain-specific post-training should be supported by adequate domain-specific CPT data: without it, SFT performance remains suboptimal and RL can even degrade such performance.”. Without CPT, even strong pre-training leads to poor downstream performance.
Tweet media one
1
0
1
@_hanlin_zhang_
Hanlin Zhang
6 days
[7/n] Takeaway 4️⃣ - “Continued pre-training on domain-specific data induces catastrophic forgetting of pre-trained knowledge, which could harm both upstream and downstream performance, while incorporating a small replay budget (e.g. 5%) could effectively mitigate this
Tweet media one
1
0
2
@_hanlin_zhang_
Hanlin Zhang
6 days
[6/n] Takeaway 3️⃣ - .“Under limited pre-training budgets, smaller post-trained models can even outperform larger counterparts. Conversely, once pre-training tokens reach the saturation regime, increasing model size enables clear improvements in both in-domain performance and OOD
Tweet media one
1
0
1
@_hanlin_zhang_
Hanlin Zhang
6 days
[5/n] Takeaway 2️⃣ - “Excessive general-domain pre-training does not always improve domain-specific post-training and might even cause performance degradation on some downstream tasks.”. We evaluated SFT and RL models initialized from various pre-training budgets. Beyond 80–160B
Tweet media one
1
0
1
@_hanlin_zhang_
Hanlin Zhang
6 days
[4/n] Takeaway 1️⃣ - “>16x Chinchilla general-domain pre-training improves upstream performance but with diminishing returns.”. We pre-trained models on 10B–320B tokens. Upstream cloze accuracy (e.g., HellaSwag, PIQA) improves until ~80–160x model size, then flattens. Example: 1B
Tweet media one
1
0
1
@_hanlin_zhang_
Hanlin Zhang
6 days
[3/n] In EvoLM, we. ✅ Build a fully transparent and reproducible model suite for studying LM training.✅ Quantify how each training phase contributes to upstream cloze task performance and downstream generative task performance, considering both in-domain and out-of-domain.
1
0
1
@_hanlin_zhang_
Hanlin Zhang
6 days
[2/n] We train 100+ decoder-only LMs (1B/4B) from scratch, across four training stages —. 🟦 Pre-training.🟩 Continued Pre-Training (CPT).🟨 Supervised Fine-Tuning (SFT).🟥 Reinforcement Learning (RL). Under controlled conditions and with full transparency regarding the data and.
1
0
1
@_hanlin_zhang_
Hanlin Zhang
6 days
[1/n] Discussions about LM reasoning and post-training have gained momentum. We identify several missing pieces:. ✏️Post-training based on off-the-shelf base models without transparent pre-training data components and scale. ✏️Intermediate checkpoints with incomplete learning.
1
13
239
@_hanlin_zhang_
Hanlin Zhang
7 days
RT @ori_press: Do language models have algorithmic creativity?. To find out, we built AlgoTune, a benchmark challenging agents to optimize….
0
54
0
@_hanlin_zhang_
Hanlin Zhang
21 days
0
0
1
@_hanlin_zhang_
Hanlin Zhang
21 days
[7/n] Got no A100s/H100s in the lab? Observational studies 🔬can play a role by generating hypotheses that align with — and can later be validated by — rigorous experiments 🧪conducted elsewhere. With increased standardization, researchers may make progress with more scalable.
1
0
1