Yifan Zhang @yifan_zhang_ X Profile

Yifan Zhang

@yifan_zhang_

Followers

4K

Following

7K

Media

76

Statuses

433

PhD student at @Princeton University, Princeton AI Lab Fellow, focusing on LLMs. Language Modeling & Pretraining, LLM Reasoning & RL. Prev @ Seed @Tsinghua_IIIS

https://t.co/Kqbq5OkxjO

New York Metropolitan Area

Joined October 2022

Don't wanna be here? Send us removal request.

Yifan Zhang

@yifan_zhang_

1 day

💡How to train a frontier model effectively? 1. Pretrain a gigantic MoE model from scratch using a full attention model (GQA or TPA https://t.co/5AJoEjl6oH [1]) mixed with some shortSWA ( https://t.co/tvLI2VQgWu, used in GPT-OSS) or (Higher) Linear Attention

yifanzhang-pro.github.io

Why Short Sliding Window Attention Will Replace ShortConv in Modern Architectures.

Yifan Zhang

@yifan_zhang_

1 day

Confirmed that Gemini 3 Flash uses Distillation Pretraining. Awesome!

5

31

386

Yifan Zhang

@yifan_zhang_

4 hours

FYI

Delta Institute

@DeltaInstitutes

15 hours

🚨 Academics don’t have enough compute! We’re solving this by connecting seekers (those who need more compute) with providers (companies with extra compute and GPU providers) Fill out the relevant form below! Seeker: https://t.co/r2t36IpSPk Provider:

0

2

12

Yifan Zhang

@yifan_zhang_

14 hours

True, Muon is Shampoo without Momentum.

mikail

@Gradientdinner

14 hours

@kalomaze @eliebakouch @swyx @yifan_zhang_ Muon is shampoo 😉

2

1

45

Yifan Zhang

@yifan_zhang_

15 hours

Keep Scaling!

Wenhao Chai

@wenhaocha1

15 hours

One potential upside is that during PRO training, we don’t need to overthink the efficiency–performance trade-off; the only goal is to make the model stronger.

0

21

Yifan Zhang

@yifan_zhang_

19 hours

🫡

rohan anil

@_arohan_

20 hours

@eliebakouch @swyx @yifan_zhang_ Yes its’s also used for Gemini Nano v1. The idea was baked on a casual chat between Sergey B and I right after PaLM2 training over GVC but took a while to land in gemini.

0

7

Yifan Zhang

@yifan_zhang_

20 hours

Marked.

elie

@eliebakouch

20 hours

@swyx @yifan_zhang_ they also used it in previous iteration of gemini, i think here it's clear that they are talking about logit base distillation (or co distillation) imo

1

0

21

Yifan Zhang

@yifan_zhang_

1 day

Scaling (Pro), then distilling (Flash): the future is clear.

Yifan Zhang

@yifan_zhang_

1 day

Confirmed that Gemini 3 Flash uses Distillation Pretraining. Awesome!

3

4

92

Yifan Zhang

@yifan_zhang_

1 day

Confirmed that Gemini 3 Flash uses Distillation Pretraining. Awesome!

Arnaud Autef

@arnaud_autef

5 days

Gemini3 Flash was my first release as distillation TL. As it turns out, our bets paid off and Flash is a huge success! Very grateful to work in such a talented team, to @FeinbergVlad for the trust and leadership, and excited to keep pushing: there is so much morecoming ⚡️⚡️⚡️

13

19

467

Yifan Zhang

@yifan_zhang_

2 days

Just got early access to Minimax M2.1 from @MiniMax__AI. Vibe coding with Minimax M2.1 to create an illustration SVG of the Minimax Theorem used in deep learning adversarial robustness, shown below. It’s great!

1

6

49

Yifan Zhang

@yifan_zhang_

2 days

🚀 Glad to see Claude Code v2.0.74 add an LSP (Language Server Protocol) tool for code intelligence features such as go-to-definition, find references, and hover documentation! Please see our previous position paper, Language Server CLI Empowers Language Agents with Process

1

7

89

Yifan Zhang

@yifan_zhang_

2 days

Hi Mayank, will Expert Parallel be supported for this awesome project?

Mayank Mishra

@MayankMish98

4 days

We cooked!🚀🚀🚀 Releasing SonicMoE: a fast MoE implementation for NVIDIA H100s GPUs. Special thanks to my collaborators: @WentaoGuo7 @XinleC295 @istoica05 and @tri_dao from whom I learnt a lot!

1

35

Yifan Zhang

@yifan_zhang_

3 days

SOME THING REALLY HUGE

0

6

Yifan Zhang

@yifan_zhang_

3 days

Long context and attention: what’s next? WE WILL SEE.

NomoreID

@Hangsiin

4 days

Sebastian Borgeaud (lead for Gemini pre-training @GoogleDeepMind, @borgeaud_s) said he expects that, over the next year, there will be substantial innovation in pre-training aimed at making long-context capabilities more efficient and extending models’ context lengths even

2

0

16

Yifan Zhang

@yifan_zhang_

3 days

Math Provers joined the Scaling Games.

Zheng Yuan

@GanjinZero

3 days

Excited to announce Seed-Prover 1.5 which is trained via large-scale agentic RL with Lean. It proved 580/660 Putnam problems and proved 11/12 in Putnam 2025 within 9 hours. Check details at https://t.co/4N650v3iH8. We will work on autoformalize towards contributing to real math!

0

1

34

Yifan Zhang

@yifan_zhang_

4 days

⚡️Updates on Our First Merged PR https://t.co/x4d0xBk6tJ by Ryan @gapDEEPry: Chunkwise parallel implementation of MEA in Triton is now available! 10x times faster than Softmax Attention under 1M context length!🚀

Yifan Zhang

@yifan_zhang_

7 days

🚀 Some interesting ideas about Matrix Exponential Attention (MEA). MEA approximates the matrix exponential of attention scores via a truncated Taylor series. By leveraging the state-space realization of Higher-order Linear Attention (HLA), MEA computes high-order interaction

5

7

66