yifan_zhang_ Profile Banner
Yifan Zhang Profile
Yifan Zhang

@yifan_zhang_

Followers
4K
Following
7K
Media
76
Statuses
433

PhD student at @Princeton University, Princeton AI Lab Fellow, focusing on LLMs. Language Modeling & Pretraining, LLM Reasoning & RL. Prev @ Seed @Tsinghua_IIIS

New York Metropolitan Area
Joined October 2022
Don't wanna be here? Send us removal request.
@yifan_zhang_
Yifan Zhang
1 day
💡How to train a frontier model effectively? 1. Pretrain a gigantic MoE model from scratch using a full attention model (GQA or TPA https://t.co/5AJoEjl6oH [1]) mixed with some shortSWA ( https://t.co/tvLI2VQgWu, used in GPT-OSS) or (Higher) Linear Attention
yifanzhang-pro.github.io
Why Short Sliding Window Attention Will Replace ShortConv in Modern Architectures.
@yifan_zhang_
Yifan Zhang
1 day
Confirmed that Gemini 3 Flash uses Distillation Pretraining. Awesome!
5
31
386
@yifan_zhang_
Yifan Zhang
4 hours
FYI
@DeltaInstitutes
Delta Institute
15 hours
🚨 Academics don’t have enough compute! We’re solving this by connecting seekers (those who need more compute) with providers (companies with extra compute and GPU providers) Fill out the relevant form below! Seeker: https://t.co/r2t36IpSPk Provider:
0
2
12
@yifan_zhang_
Yifan Zhang
14 hours
True, Muon is Shampoo without Momentum.
@Gradientdinner
mikail
14 hours
@kalomaze @eliebakouch @swyx @yifan_zhang_ Muon is shampoo 😉
2
1
45
@yifan_zhang_
Yifan Zhang
15 hours
Keep Scaling!
@wenhaocha1
Wenhao Chai
15 hours
One potential upside is that during PRO training, we don’t need to overthink the efficiency–performance trade-off; the only goal is to make the model stronger.
0
0
21
@yifan_zhang_
Yifan Zhang
19 hours
🫡
@_arohan_
rohan anil
20 hours
@eliebakouch @swyx @yifan_zhang_ Yes its’s also used for Gemini Nano v1. The idea was baked on a casual chat between Sergey B and I right after PaLM2 training over GVC but took a while to land in gemini.
0
0
7
@yifan_zhang_
Yifan Zhang
20 hours
Marked.
@eliebakouch
elie
20 hours
@swyx @yifan_zhang_ they also used it in previous iteration of gemini, i think here it's clear that they are talking about logit base distillation (or co distillation) imo
1
0
21
@yifan_zhang_
Yifan Zhang
1 day
Scaling (Pro), then distilling (Flash): the future is clear.
@yifan_zhang_
Yifan Zhang
1 day
Confirmed that Gemini 3 Flash uses Distillation Pretraining. Awesome!
3
4
92
@yifan_zhang_
Yifan Zhang
1 day
Confirmed that Gemini 3 Flash uses Distillation Pretraining. Awesome!
@arnaud_autef
Arnaud Autef
5 days
Gemini3 Flash was my first release as distillation TL. As it turns out, our bets paid off and Flash is a huge success! Very grateful to work in such a talented team, to @FeinbergVlad for the trust and leadership, and excited to keep pushing: there is so much morecoming ⚡️⚡️⚡️
13
19
467
@yifan_zhang_
Yifan Zhang
2 days
Just got early access to Minimax M2.1 from @MiniMax__AI. Vibe coding with Minimax M2.1 to create an illustration SVG of the Minimax Theorem used in deep learning adversarial robustness, shown below. It’s great!
1
6
49
@yifan_zhang_
Yifan Zhang
2 days
🚀 Glad to see Claude Code v2.0.74 add an LSP (Language Server Protocol) tool for code intelligence features such as go-to-definition, find references, and hover documentation! Please see our previous position paper, Language Server CLI Empowers Language Agents with Process
1
7
89
@yifan_zhang_
Yifan Zhang
2 days
Hi Mayank, will Expert Parallel be supported for this awesome project?
@MayankMish98
Mayank Mishra
4 days
We cooked!🚀🚀🚀 Releasing SonicMoE: a fast MoE implementation for NVIDIA H100s GPUs. Special thanks to my collaborators: @WentaoGuo7 @XinleC295 @istoica05 and @tri_dao from whom I learnt a lot!
1
1
35
@yifan_zhang_
Yifan Zhang
3 days
SOME THING REALLY HUGE
0
0
6
@yifan_zhang_
Yifan Zhang
3 days
Long context and attention: what’s next? WE WILL SEE.
@Hangsiin
NomoreID
4 days
Sebastian Borgeaud (lead for Gemini pre-training @GoogleDeepMind, @borgeaud_s) said he expects that, over the next year, there will be substantial innovation in pre-training aimed at making long-context capabilities more efficient and extending models’ context lengths even
2
0
16
@yifan_zhang_
Yifan Zhang
3 days
Math Provers joined the Scaling Games.
@GanjinZero
Zheng Yuan
3 days
Excited to announce Seed-Prover 1.5 which is trained via large-scale agentic RL with Lean. It proved 580/660 Putnam problems and proved 11/12 in Putnam 2025 within 9 hours. Check details at https://t.co/4N650v3iH8. We will work on autoformalize towards contributing to real math!
0
1
34
@yifan_zhang_
Yifan Zhang
4 days
⚡️Updates on Our First Merged PR https://t.co/x4d0xBk6tJ by Ryan @gapDEEPry: Chunkwise parallel implementation of MEA in Triton is now available! 10x times faster than Softmax Attention under 1M context length!🚀
@yifan_zhang_
Yifan Zhang
7 days
🚀 Some interesting ideas about Matrix Exponential Attention (MEA). MEA approximates the matrix exponential of attention scores via a truncated Taylor series. By leveraging the state-space realization of Higher-order Linear Attention (HLA), MEA computes high-order interaction
5
7
66