Zitong Yang
@ZitongYang0
Followers
1K
Following
577
Media
24
Statuses
360
Continually self-improving AI
Mountain View, CA
Joined November 2018
📜 Paper on new pretraining paradigm: Synthetic Bootstrapped Pretraining SBP goes beyond next-token supervision in a single document by leveraging inter-document correlations to synthesize new data for training — no teacher needed. Validation: 1T data + 3B model from scratch.🧵
9
49
248
Will be presenting this paper at NeurIPS 2025! 📅 Thursday, December 4, 11AM-2PM 📍 Exhibit Hall C, D, E #3703 DM me or come by in person if you want to chat about this work, or in general about representation learning, reasoning, generalization, and science of deep learning!
Why and how do diffusion models memorize vs generalize? Can we have scaling laws for memorization? This is increasingly relevant scientifically and pragmatically (e.g. Sora 2). 🚨 Our new preprint "On the Edge of Memorization in Diffusion Models" addresses this timely question!
4
12
87
A new tokenizer is introduced for LLMs: https://t.co/Zuerv1jsZ4 Idea: Instead of merging tokens by frequency (BPE), optimize the tokenizer directly for maximizing average token length, yielding longer, more efficient tokens. Results: 14–18% fewer tokens, faster training &
15
70
453
🚀 New blog! Open source model @deepseek_ai is the best as math verifier? 🔴 DeepSeek-Math V2: highest accuracy and most closely aligns with human graders when the submitted answer shows no meaningful progress. 🔵 Gemini-3-Pro: best when the solution contains partial but
2
4
6
I will join UChicago CS @UChicagoCS as an Assistant Professor in late 2026, and I’m recruiting PhD students in this cycle (2025 - 2026). My research focuses on AI & Robotics - including dexterous manipulation, humanoids, tactile sensing, learning from human videos, robot
26
100
639
I am advised by 🐐's
How Stanford researchers design human-focused AI systems: “AI products enter the real world very quickly, often without a rigorous understanding of their impact or the consequences of their use. We need to move forward with responsibility.” —@Diyi_Yang
https://t.co/wO0c8LbPsK
5
3
94
3 years ago we could showcase AI's frontier w. a unicorn drawing. Today we do so w. AI outputs touching the scientific frontier: https://t.co/ALJvCFsaie Use the doc to judge for yourself the status of AI-aided science acceleration, and hopefully be inspired by a couple examples!
75
211
1K
In the AI ecosystem, who supplies the data? the compute? the models? We just released a new tool on the AI Supply Chain. Our dataset reveals how AI models, data, compute, capital, and even talent change hands. Here’s why you should care 👇
15
39
151
I DEFENDED MY PHD THIS WEEK! 🎉 So grateful for the guidance of my advisor and committee! Special thanks to my friends and family who supported me through every up and down 🥺🥰
26
25
663
in our new post, we walk through great prior work from @agarwl_ & the @Alibaba_Qwen team exploring on-policy distillation using an open source recipe: you can run our experiments on Tinker today! https://t.co/7pVk87qTDH i'm especially excited by the use of on-policy
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other
13
24
323
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other
63
401
3K
Stanford NLP 25th Anniversary🤩🤩🤩
Today, we’re overjoyed to have a 25th Anniversary Reunion of @stanfordnlp. So happy to see so many of our former students back at @Stanford. And thanks to @StanfordHAI for the venue!
9
39
601
More Stanford NLP Group 25th Anniversary Reunion lightning talks: …, @ZitongYang0, @EchoShao8899, @WilliamBarrHeld, @ma_tay_ (Taylor Sorensen), …
0
15
70
Wrote a 1-year retrospective with @a1zhang on KernelBench and the journey toward automated GPU/CUDA kernel generations! Since my labmates (@anneouyang, @simran_s_arora, @_williamhu) and I first started working towards this vision around last year’s @GPU_mode hackathon, we have
11
66
289
Neural Architecture Search is so visionary
Google brain around 2016 also was a very special place. People were pursuing a ton of diverse, exploratory and ambitious directions to push the field forward. Here's a section of @JeffDean's Google Brain "2017 Look-back", see if you can spot the transformer :) The full document
0
0
5
Thanks @thinkymachines for supporting Tinker access for our CS329x students on Homework 2 😉
Its not even been a month since @thinkymachines released Tinker & Stanford already has an assignment on it
8
38
587
Fine-tuning APIs are becoming more powerful and widespread, but they're harder to safeguard against misuse than fixed-weight sampling APIs. Excited to share a new paper: Detecting Adversarial Fine-tuning with Auditing Agents ( https://t.co/NqMeGSCQIF). Auditing agents search
arxiv.org
Large Language Model (LLM) providers expose fine-tuning APIs that let end users fine-tune their frontier LLMs. Unfortunately, it has been shown that an adversary with fine-tuning access to an LLM...
10
52
464
The passing of the physicist Chen-Ning Yang ( https://t.co/LOY46RpBhz) saddens me. He has been a long-time hero and role model for me. Below is a short essay I wrote yesterday about Yang that I shared with many of my friends. I translated it into English using Gemini: ``` The
10
65
415
Why and how do diffusion models memorize vs generalize? Can we have scaling laws for memorization? This is increasingly relevant scientifically and pragmatically (e.g. Sora 2). 🚨 Our new preprint "On the Edge of Memorization in Diffusion Models" addresses this timely question!
5
64
371
🚨 We wrote a new AI textbook "Learning Deep Representations of Data Distributions"! TL;DR: We develop principles for representation learning in large scale deep neural networks, show that they underpin existing methods, and build new principled methods.
6
66
298