Hanlin Zhang @_hanlin_zhang_ X Profile

Hanlin Zhang

@_hanlin_zhang_

Followers

792

Following

590

Media

31

Statuses

92

CS PhD student @Harvard, @googleai

Joined September 2019

Don't wanna be here? Send us removal request.

Hanlin Zhang

@_hanlin_zhang_

8 months

Critical batch size is crucial for reducing the wall-clock time of large-scale training runs with data parallelism. We find that it depends primarily on data size. 🧵 [1/n].Paper 📑: Blog 📝:

1

12

122

Hanlin Zhang

@_hanlin_zhang_

6 days

[17/n] Final thoughts. EvoLM offers:. 🔓 100+ open LLMs. 📜 Controlled full-stage training. 📊 Evaluations across cloze, generative, ID/OOD tasks. 📦 Full code, data, and ongoing support. Kudos to the team @ZhentingQi, @FanNie1208, @AlexAlahi, @james_y_zou, @hima_lakkaraju,.

0

1

Hanlin Zhang

@_hanlin_zhang_

6 days

[16/n] Takeaway 1️⃣3️⃣. 📷.“ORM score could be a more reliable unsupervised validation metric that helps predict downstream task performance during post-training, compared to validation loss. Notably, ORM scores from an 8B reward model correlate well with problem-solving accuracies

1

0

2

Hanlin Zhang

@_hanlin_zhang_

6 days

[15/n] Takeaway 1️⃣2️⃣. “Under a constrained downstream data budget, allocating more examples to SFT maximizes in-domain gains at the expense of weaker OOD generalization, while allocating more to RL improves OOD performance.”. With 100K total examples:.90K SFT + 10K RL = best ID

1

0

2

Hanlin Zhang

@_hanlin_zhang_

6 days

[14/n] Takeaway 1️⃣1️⃣.“Beyond saturation regime, RL primarily increases the probability of sampling high-quality rollouts but may not necessarily improve models’ fundamental reasoning capabilities.”. RL amplifies confidence, not competence.

1

0

1

Hanlin Zhang

@_hanlin_zhang_

6 days

[13/n] Takeaway 🔟 -📷“RL with excessive epochs or examples improves downstream performance on both ID and OOD tasks, but with diminishing returns.”. We scale RL epochs and dataset sizes separately. Performance peaks at ~8 epochs or ~100K examples for 1B models. After that,

1

0

2

Hanlin Zhang

@_hanlin_zhang_

6 days

[12/n] Takeaway 9️⃣ - 📷“Excessive SFT, especially overly large epochs, could limit further RL improvements.”. Once the model memorizes via SFT, RL has little room to further improve. → Overfitting in SFT bottlenecks RL. 🛑 Stop SFT early if planning RL.

1

0

1

Hanlin Zhang

@_hanlin_zhang_

6 days

[11/n] Takeaway 8️⃣ - “Excessive SFT improves ID performance with diminishing returns but does not necessarily improve and can even degrade OOD performance.”. We scale both epochs (1–32) and dataset size (50K–400K):.ID metrics 📈.OOD metrics plateau or drop.⚖️ Balance SFT

1

0

1

Hanlin Zhang

@_hanlin_zhang_

6 days

[10/n] Takeaway 7️⃣ - “With sufficient domain-specific CPT data, post-training on in-domain tasks not only improves in-domain performance but also generalizes effectively to OOD tasks.”. With enough CPT (e.g. 42B math tokens), post-trained models can generalize well to OOD.

1

0

1

Hanlin Zhang

@_hanlin_zhang_

6 days

[9/n] Takeaway 6️⃣ - “As domain-specific CPT data increase, in-domain downstream performance steadily improves, and the SFT models could benefit more from RL finetuning.”. Scaling CPT from 2B → 42B tokens = monotonic ID performance gains. Plus: .🟡 RL helps more when CPT is.

1

0

1

Hanlin Zhang

@_hanlin_zhang_

6 days

[8/n] Takeaway 5️⃣-📷 “Domain-specific post-training should be supported by adequate domain-specific CPT data: without it, SFT performance remains suboptimal and RL can even degrade such performance.”. Without CPT, even strong pre-training leads to poor downstream performance.

1

0

1

Hanlin Zhang

@_hanlin_zhang_

6 days

[7/n] Takeaway 4️⃣ - “Continued pre-training on domain-specific data induces catastrophic forgetting of pre-trained knowledge, which could harm both upstream and downstream performance, while incorporating a small replay budget (e.g. 5%) could effectively mitigate this

1

0

2

Hanlin Zhang

@_hanlin_zhang_

6 days

[6/n] Takeaway 3️⃣ - .“Under limited pre-training budgets, smaller post-trained models can even outperform larger counterparts. Conversely, once pre-training tokens reach the saturation regime, increasing model size enables clear improvements in both in-domain performance and OOD

1

0

1

Hanlin Zhang

@_hanlin_zhang_

6 days

[5/n] Takeaway 2️⃣ - “Excessive general-domain pre-training does not always improve domain-specific post-training and might even cause performance degradation on some downstream tasks.”. We evaluated SFT and RL models initialized from various pre-training budgets. Beyond 80–160B

1

0

1

Hanlin Zhang

@_hanlin_zhang_

6 days

[4/n] Takeaway 1️⃣ - “>16x Chinchilla general-domain pre-training improves upstream performance but with diminishing returns.”. We pre-trained models on 10B–320B tokens. Upstream cloze accuracy (e.g., HellaSwag, PIQA) improves until ~80–160x model size, then flattens. Example: 1B

1

0

1

Hanlin Zhang

@_hanlin_zhang_

6 days

[3/n] In EvoLM, we. ✅ Build a fully transparent and reproducible model suite for studying LM training.✅ Quantify how each training phase contributes to upstream cloze task performance and downstream generative task performance, considering both in-domain and out-of-domain.

1

0

1

Hanlin Zhang

@_hanlin_zhang_

6 days

[2/n] We train 100+ decoder-only LMs (1B/4B) from scratch, across four training stages —. 🟦 Pre-training.🟩 Continued Pre-Training (CPT).🟨 Supervised Fine-Tuning (SFT).🟥 Reinforcement Learning (RL). Under controlled conditions and with full transparency regarding the data and.

1

0

1

Hanlin Zhang

@_hanlin_zhang_

6 days

[1/n] Discussions about LM reasoning and post-training have gained momentum. We identify several missing pieces:. ✏️Post-training based on off-the-shelf base models without transparent pre-training data components and scale. ✏️Intermediate checkpoints with incomplete learning.

1

13

239

Hanlin Zhang

@_hanlin_zhang_

7 days

RT @ori_press: Do language models have algorithmic creativity?. To find out, we built AlgoTune, a benchmark challenging agents to optimize….

0

54

0

Hanlin Zhang

@_hanlin_zhang_

21 days

Tagging relevant folks @Sanjeev, @Yoshua_Bengio, @DonskerClass, @zacharylipton, @yudapearl, @bschoelkopf, @ericxing.

0

1

Hanlin Zhang

@_hanlin_zhang_

21 days

[7/n] Got no A100s/H100s in the lab? Observational studies 🔬can play a role by generating hypotheses that align with — and can later be validated by — rigorous experiments 🧪conducted elsewhere. With increased standardization, researchers may make progress with more scalable.

1

0

1