
Zitong Yang
@ZitongYang0
Followers
867
Following
542
Media
23
Statuses
342
Continually self-improving AI
Stanford, CA
Joined November 2018
📜 Paper on new pretraining paradigm: Synthetic Bootstrapped Pretraining SBP goes beyond next-token supervision in a single document by leveraging inter-document correlations to synthesize new data for training — no teacher needed. Validation: 1T data + 3B model from scratch.🧵
9
50
246
Why and how do diffusion models memorize vs generalize? Can we have scaling laws for memorization? This is increasingly relevant scientifically and pragmatically (e.g. Sora 2). 🚨 Our new preprint "On the Edge of Memorization in Diffusion Models" addresses this timely question!
4
58
335
🚨 We wrote a new AI textbook "Learning Deep Representations of Data Distributions"! TL;DR: We develop principles for representation learning in large scale deep neural networks, show that they underpin existing methods, and build new principled methods.
4
29
124
the feeling when you spent two months building the training infra and finally got the first experiment running 🥹
1
1
41
We wrote a book about representation learning! It’s fully open source, available and readable online, and covers everything from theoretical foundations to practical algorithms. 👷♂️ We’re hard at work updating the content for v2.0, and would love your feedback and contributions
13
203
1K
Introducing Tinker: a flexible API for fine-tuning language models. Write training loops in Python on your laptop; we'll run them on distributed GPUs. Private beta starts today. We can't wait to see what researchers and developers build with cutting-edge open models!
221
766
6K
Very excited about this release!! As a former grad student I struggled to finetune llms. Even when the gpus are enough, it was painful to set up the infra correctly. Tinker allows more researchers to understand and language models, beyond a few well-funded labs.
Introducing Tinker: a flexible API for fine-tuning language models. Write training loops in Python on your laptop; we'll run them on distributed GPUs. Private beta starts today. We can't wait to see what researchers and developers build with cutting-edge open models!
2
10
199
Nobel laureate George Smoot, UC Berkeley physicist whose work with satellite experiments confirmed the Big Bang theory, has died at 80. https://t.co/Jx2Hks3PMJ
4
8
15
LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.
81
555
3K
check out what @bfspector worked on this summer! (he has not seen the sky for months but now he's free)
(1/8) We’re releasing an 8-GPU Llama-70B inference engine megakernel! Our megakernel supports arbitrary batch sizes, mixed prefill+decode, a paged KV cache, instruction pipelining, dynamic scheduling, interleaved communication, and more! On ShareGPT it’s 22% faster than SGLang.
0
2
28
We’re open-sourcing Docent under an Apache 2.0 license. Check out our public codebase to self-host Docent, peek under the hood, or open issues & pull requests! The hosted version remains the easiest way to get started with one click and use Docent with zero maintenance overhead.
Docent, our tool for analyzing complex AI behaviors, is now in public alpha! It helps scalably answer questions about agent behavior, like “is my model reward hacking” or “where does it violate instructions.” Today, anyone can get started with just a few lines of code!
1
13
70
Can AI really do math? 🤔 We analyzed math ability across 12 core skills like creativity, abstraction, reasoning & more. This is the way, to measure progress toward Math AGI.
Our Gauss report is now on the arxiv: https://t.co/3iFk2yaeUf Does the current LLM models solve math problems with memorisation or understanding? Can it truly grasp abstract concepts or simply exploit correlations through compression? That is THE next trillion-dollar question.
4
2
17
Our Gauss report is now on the arxiv: https://t.co/3iFk2yaeUf Does the current LLM models solve math problems with memorisation or understanding? Can it truly grasp abstract concepts or simply exploit correlations through compression? That is THE next trillion-dollar question.
arxiv.org
We introduce \textbf{GAUSS} (\textbf{G}eneral \textbf{A}ssessment of \textbf{U}nderlying \textbf{S}tructured \textbf{S}kills in Mathematics), a benchmark that evaluates LLMs' mathematical...
4
13
55
some folks and i are making something new if you're hopeful about AI empowering everyone if you've worked on multiturn, memory, model behavior, multiagent RL, user sim, AI interfaces/products, kernels, or dist systems if you want frontier-scale compute & top infra let's chat!
50
26
570
while we are on this, rmb we also had: - Neural Architecture Search with Reinforcement Learning https://t.co/qvGwkX41VE - Symbolic Discovery of Optimization Algorithms https://t.co/lJzotdjyOM - Using Large Language Models for Hyperparameter Optimization https://t.co/WDQQoX7cc7 -
arxiv.org
This paper explores the use of foundational large language models (LLMs) in hyperparameter optimization (HPO). Hyperparameters are critical in determining the effectiveness of machine learning...
4
11
43
Excited to share Manzano from AFM team—a simple, scalable unified multimodal model for understanding and generation. Manzano shows minimal task conflict, promising scaling behavior and state-of-the-art results among unified models. Paper link: https://t.co/HpziryrvSc
1
8
16
Huge potential! Apple and Stanford have just released Synthetic Bootstrapped Pretraining (SBP). Standard LM pretraining = token correlations in one doc. SBP = learns inter-document relations → synthesizes a huge new corpus for joint training. ✨ Pretrained 3B model on 1T
4
25
137
Enjoyed learning from world-class embedding expert @HongLiu9903. I think document embedding offers a new venue of under-exploited self-supervision because they arrange related documents together. Much like how internet arranged related tokens together.
🚀 Unveiling the first synthetic pretraining method that doesn’t rely on teacher distillation. Big shoutout to @ZitongYang0 @Aonan12 and the team!
0
0
6
I'm guessing that Qwen Max was trained this way as only that could explin some of it's capabilities (and size and high quality data and long pretraining ;) ) This is a only sensible approach due to "data density" prolem in modeling for the ones that read my post
📜 Paper on new pretraining paradigm: Synthetic Bootstrapped Pretraining SBP goes beyond next-token supervision in a single document by leveraging inter-document correlations to synthesize new data for training — no teacher needed. Validation: 1T data + 3B model from scratch.🧵
1
1
1
Feeling inspired by @ChengleiSi at every AGI hackathon
always feeling inspired by @ZitongYang0
0
0
3