Runtian Zhai
@RuntianZhai
Followers
526
Following
209
Media
8
Statuses
198
PhD @SCSatCMU. I study representation learning, and machine learning theory and algorithms.
Pittsburgh, PA, USA
Joined April 2017
Why can foundation models transfer to so many downstream tasks? Will the scaling law end? Will pretraining end like Ilya Sutskever predicted? My PhD thesis builds the contexture theory to answer the above. Blog: https://t.co/MCIJifkU1Z Paper: https://t.co/RXVF7n7mHR đź§µ1/12
arxiv.org
This dissertation establishes the contexture theory to mathematically characterize the mechanism of representation learning, or pretraining. Despite the remarkable empirical success of foundation...
2
31
163
Excited to share our work with my amazing collaborators, @Goodeat258, @SimulatedAnneal, @zicokolter, and Kaiming. In a word, we show an “identity learning” approach for generative modeling, by relating the instantaneous/average velocity in an identity. The resulting model,
5
39
153
Data selection and curriculum learning can be formally viewed as a compression protocol via prequential coding. New blog (with @AllanZhou17 ) about this neat idea that motivated ADO but didn’t make it into the paper. https://t.co/kkLyZN2CF7
yidingjiang.github.io
We describe a unified framework for data selection and curriculum learning via compression.
2
18
106
Announcing the 1st Workshop on Methods and Opportunities at Small Scale (MOSS) at @icmlconf 2025! 🔗Website: https://t.co/lZdKPrw4Pt 📝 We welcome submissions! 📅 Paper & jupyter notebook deadline: May 22, 2025 Topics: – Inductive biases & generalization – Training
0
15
44
A shorter version of the first three chapters of my thesis is accepted by ICML 2025. It provides a quick start for those interested in learning about the contexture theory. Check it out:
arxiv.org
Despite the empirical success of foundation models, we do not have a systematic characterization of the representations that these models learn. In this paper, we establish the contexture theory....
Why can foundation models transfer to so many downstream tasks? Will the scaling law end? Will pretraining end like Ilya Sutskever predicted? My PhD thesis builds the contexture theory to answer the above. Blog: https://t.co/MCIJifkU1Z Paper: https://t.co/RXVF7n7mHR đź§µ1/12
1
2
37
In our #AISTATS2025 paper, we ask: when it is possible to recover a consistent joint distribution from conditionals? We propose path consistency and autoregressive path consistency—necessary and easily verifiable conditions. See you at Poster session 3, Monday 5th May.
1
7
15
✨ Love 4o-style image generation but prefer to use Midjourney? Tired of manual prompt crafting from inspo images? PRISM to the rescue! 🖼️→📝→🖼️ We automate black-box prompt engineering—no training, no embeddings, just accurate, readable prompts from your inspo images! 1/🧵
3
31
84
In sum, representations are learned from association between X & A, by extracting the top-d eigenspace of operator T. Better contexts are crucial, and a good one has moderate association. Building this theory was an exciting journey and I’m grateful to all my collaborators. 12/12
0
0
1
My theory shows that representation learning can do all types of system-1 (associative) thinking, once we define X and A and have enough data. It cannot do system-2 (analytic) thinking, which is why reasoning is so hard. I plan to develop theory for system-2 in the future. 11/12
1
0
3
Convex combination optimizes a linear combination of multiple objectives; it balances weak & strong associations. Concatenation combines the embeddings of multiple models, like feature engineering where we concatenate multiple features, and it strengthens the association. 10/12
1
0
1
If we have contexts that are too strong/weak, what can we do? We can mix multiple contexts to balance their associations! I define 3 base operations: Convolution, convex combination, concatenation. Convolution is like composing data augmentations; it weakens the association. 9/12
1
0
1
This association level controls the decay rate of the singular values: Weaker association leads to faster decay. If too fast, few tasks are compatible and the model won’t be transferable; if too slow, a larger embedding dimension is needed, causing higher sample complexity. 8/12
1
0
1
Hence, we need better contexts. I believe we can get them, so progress in pretraining won’t end. First, we must understand which contexts are good. I show that a good context has a moderate association between X & A. For example, BERT is best when the mask ratio is moderate. 7/12
1
0
1
The important implication is that for a fixed pretrain context, *scaling up the model size inevitably produces diminishing returns*, since the representation will converge to the contexture (the span of top-d singular functions). Upon convergence, further scaling has no use. 6/12
1
0
2
Transferability results from compatibility between pretrain context & downstream task, meaning that the context is useful for learning the task. I mathematically formulate compatibility, and show that contexture minimizes the worst error on compatible tasks, so it’s optimal. 5/12
1
0
3
This perspective works for many paradigms: supervised/(non-)contrastive learning, denoising autoencoders, generative models, graph node embedding, etc. This d-dim space can be obtained by training big models to optimize certain objectives, instead of non-scalable kernel PCA. 4/12
1
0
2
The association is given by their joint distribution; its marginals give two L2 function spaces. The expectation operator T maps g(a) to E[g(A)|x]. I show that repre. learning extracts the linear span of the top-d singular functions of T. I call this learning the contexture. 3/12
1
0
1
It isn’t theoretically clear what representation is learned by a foundation model. My theory shows that a representation is learned from the association between the input X and a context variable A. Example: A can be label of X, the first k tokens of X, or a crop of image X. 2/12
1
0
2
I guess our lab does not even have an academic budget then … #ICLR2025 keynote talk by Danqi Chen
1
9
130