Yifan Zhang
@yifan_zhang_
Followers
4K
Following
7K
Media
76
Statuses
433
PhD student at @Princeton University, Princeton AI Lab Fellow, focusing on LLMs. Language Modeling & Pretraining, LLM Reasoning & RL. Prev @ Seed @Tsinghua_IIIS
New York Metropolitan Area
Joined October 2022
💡How to train a frontier model effectively? 1. Pretrain a gigantic MoE model from scratch using a full attention model (GQA or TPA https://t.co/5AJoEjl6oH [1]) mixed with some shortSWA ( https://t.co/tvLI2VQgWu, used in GPT-OSS) or (Higher) Linear Attention
yifanzhang-pro.github.io
Why Short Sliding Window Attention Will Replace ShortConv in Modern Architectures.
5
31
386
FYI
🚨 Academics don’t have enough compute! We’re solving this by connecting seekers (those who need more compute) with providers (companies with extra compute and GPU providers) Fill out the relevant form below! Seeker: https://t.co/r2t36IpSPk Provider:
0
2
12
🫡
@eliebakouch @swyx @yifan_zhang_ Yes its’s also used for Gemini Nano v1. The idea was baked on a casual chat between Sergey B and I right after PaLM2 training over GVC but took a while to land in gemini.
0
0
7
Marked.
@swyx @yifan_zhang_ they also used it in previous iteration of gemini, i think here it's clear that they are talking about logit base distillation (or co distillation) imo
1
0
21
Confirmed that Gemini 3 Flash uses Distillation Pretraining. Awesome!
Gemini3 Flash was my first release as distillation TL. As it turns out, our bets paid off and Flash is a huge success! Very grateful to work in such a talented team, to @FeinbergVlad for the trust and leadership, and excited to keep pushing: there is so much morecoming ⚡️⚡️⚡️
13
19
467
Just got early access to Minimax M2.1 from @MiniMax__AI. Vibe coding with Minimax M2.1 to create an illustration SVG of the Minimax Theorem used in deep learning adversarial robustness, shown below. It’s great!
1
6
49
🚀 Glad to see Claude Code v2.0.74 add an LSP (Language Server Protocol) tool for code intelligence features such as go-to-definition, find references, and hover documentation! Please see our previous position paper, Language Server CLI Empowers Language Agents with Process
1
7
89
Hi Mayank, will Expert Parallel be supported for this awesome project?
We cooked!🚀🚀🚀 Releasing SonicMoE: a fast MoE implementation for NVIDIA H100s GPUs. Special thanks to my collaborators: @WentaoGuo7 @XinleC295 @istoica05 and @tri_dao from whom I learnt a lot!
1
1
35
Long context and attention: what’s next? WE WILL SEE.
Sebastian Borgeaud (lead for Gemini pre-training @GoogleDeepMind, @borgeaud_s) said he expects that, over the next year, there will be substantial innovation in pre-training aimed at making long-context capabilities more efficient and extending models’ context lengths even
2
0
16
Math Provers joined the Scaling Games.
Excited to announce Seed-Prover 1.5 which is trained via large-scale agentic RL with Lean. It proved 580/660 Putnam problems and proved 11/12 in Putnam 2025 within 9 hours. Check details at https://t.co/4N650v3iH8. We will work on autoformalize towards contributing to real math!
0
1
34
⚡️Updates on Our First Merged PR https://t.co/x4d0xBk6tJ by Ryan @gapDEEPry: Chunkwise parallel implementation of MEA in Triton is now available! 10x times faster than Softmax Attention under 1M context length!🚀
🚀 Some interesting ideas about Matrix Exponential Attention (MEA). MEA approximates the matrix exponential of attention scores via a truncated Taylor series. By leveraging the state-space realization of Higher-order Linear Attention (HLA), MEA computes high-order interaction
5
7
66