Shuangfei Zhai Profile
Shuangfei Zhai

@zhaisf

Followers
2K
Following
189
Media
23
Statuses
143

Research Scientist & Manager, Machine Learning Research @ Apple

Cupertino
Joined April 2010
Don't wanna be here? Send us removal request.
@zhaisf
Shuangfei Zhai
1 year
We attempted to make Normalizing Flows work really well, and we are happy to report our findings in paper https://t.co/SbdXPZy6ed, and code https://t.co/CMOo3svcPK. [1/n]
4
47
231
@zhaisf
Shuangfei Zhai
10 hours
It’s pretty incredible that as intelligent as the models are, they are all built on top of something as dumb as BPE.
@kalomaze
kalomaze
1 day
a helpful reminder that - while BPE may be used to tokenize autoregressive language models, - BPE merge rules sure as hell don't strictly follow autoregressive dependencies
0
1
11
@zhaisf
Shuangfei Zhai
1 day
Check out the new addition to our TarFlow franchise. TLDR: normalizing flows “just work” for generating videos. This adds another strong evidence to our argument that NFs are capable generative models; and I’m now more convinced than ever that they will continue working better.
@thoma_gu
Jiatao Gu🛩️NeurIPS2025
11 days
STARFlow gets an upgrade—it now works on videos🎥 We present STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flows, a invertible, causal video generator built on autoregressive flows! 📄 Paper https://t.co/fHApEwGg8j 💻 Code https://t.co/ATU9XtacsQ (1/10)
0
13
65
@iamct_r
Tianrong Chen 陈天荣
2 months
Here is our niche accepted paper ( https://t.co/dJoM1UitWk) in NeurIPS2025 :) This is also a summary of previous augmented diffusion models (PFGM, CLD, AGM, etc.), and investigates in which cases, if any, they are actually useful in practice.
1
8
20
@zhaisf
Shuangfei Zhai
3 months
Three generative modeling papers from my team accepted to #NeurIPS2025, two on TarFlow and one on Diffusion. 1. StarFlow (Spotlight) https://t.co/ynB2z4Xgfa, scales TarFlow in latent space and demonstrates unprecedented sample quality from pure NF models. Work led by @thoma_gu.
Tweet card summary image
arxiv.org
We present STARFlow, a scalable generative model based on normalizing flows that achieves strong performance in high-resolution image synthesis. The core of STARFlow is Transformer Autoregressive...
5
26
260
@zhaisf
Shuangfei Zhai
3 months
So there is a type of RNN that follows a simple form y_{t+1} = y_t + f(y_t, x_t). It has a state size that scales to billions of dimensions, handles sequence lengths in 10s of millions and has excellent memorization capability. You might have guessed it, I'm talking about SGD.
5
16
226
@zhaisf
Shuangfei Zhai
3 months
Beta-VAE is yet another example of the scalar * loss case. The truth is you don’t actually need any new formulation to introduce the beta. You can simply let the decoder p(x|z)’s variance be beta instead of 1, and the VAE loss becomes 1/beta * (recon_loss + beta * kl_loss).
@zhaisf
Shuangfei Zhai
3 months
Loss functions, as principled as they might appear, are mostly made up surrogates. And, their values don’t matter, it’s the direction of derivatives that we care about — a silly example is that multiplying any loss with a positive scalar yields a different but equally good loss.
1
2
17
@zhaisf
Shuangfei Zhai
3 months
So you can even directly design the training signal without specifying a loss function. Eg, starting from the standard cross entropy loss CE(f, y), with d CE(f, y) / df = SM(f) - y. Now we prescribe a different “derivative” in the form of sign(SM(f) - y) * (SM(f) - y)^p. With
@zhaisf
Shuangfei Zhai
3 months
Loss functions, as principled as they might appear, are mostly made up surrogates. And, their values don’t matter, it’s the direction of derivatives that we care about — a silly example is that multiplying any loss with a positive scalar yields a different but equally good loss.
0
1
21
@zhaisf
Shuangfei Zhai
3 months
A less trivial example: GANs can be thought of a model trained with a learned loss. Meaning, the generator is trained by min_g -D(g(z)), where D is discriminator which is also learned. Like what we are saying above, all we care about is the direction of d D(x) / dx, not D(x)'s
0
0
8
@zhaisf
Shuangfei Zhai
3 months
Loss functions, as principled as they might appear, are mostly made up surrogates. And, their values don’t matter, it’s the direction of derivatives that we care about — a silly example is that multiplying any loss with a positive scalar yields a different but equally good loss.
11
4
135
@zhaisf
Shuangfei Zhai
3 months
PS, like many of my previous posts, these are just ML101 and are not supposed to be surprising :) But I find great fun in playing with them with the help of modern AI tools, in this case google colab+gemini, with the help of Claude, and it’s become a new hobby of mine.
1
0
6
@zhaisf
Shuangfei Zhai
3 months
We can also visualize the distribution of the influence vector for each class. As you might have guessed, it's extremely polarized -- the majority of examples don't contribute to the classifier weights, and a few hard examples dominate (again, this is a lot like support vectors).
0
0
5
@zhaisf
Shuangfei Zhai
3 months
If you train a softmax NN classifier with plain SGD, the classifier’s weights have a very simple update trajectory. At t-th iteration for the k-th class, W_k^{t+1} = W_k ^t + eta \sum_{i=1}^B (y^i_k - p(x^i)_k) h^t(x^i) / B, where y is the one hot label, p is the softmax
4
4
29
@zhaisf
Shuangfei Zhai
3 months
In case the ReLU example sounds trivial, consider two layers instead: y = W1 relu(W0 x). You can rewrite it as y = \sum_{k=1}^{2^d} g_k(x) W_k x. Here W_k is composed by W1 diag(m_k) W0, with m_k being a binary mask and there are 2^d of them; g(x) is the one hot representation of
@zhaisf
Shuangfei Zhai
3 months
ReLU is essentially an element wise MoE with shared router and expert weights: y= I(w^Tx >= 0^Tx) w^Tx + I(w^Tx < 0^Tx) 0^Tx, where the two experts are parameterized by the w and 0 (zero) vectors. And then you only need to train the expert weights and router just follows along.
0
1
21
@zhaisf
Shuangfei Zhai
3 months
Fun history lesson!
@tang_1c
Charlie Tang
3 months
😎 Fun fact: ReLUs were originally invented in 2010 to improve Gaussian Restricted Boltzmann Machines, a type of *generative* model for which @geoffreyhinton won the Nobel prize. (more 👇)
2
0
21
@zhaisf
Shuangfei Zhai
3 months
ReLU is essentially an element wise MoE with shared router and expert weights: y= I(w^Tx >= 0^Tx) w^Tx + I(w^Tx < 0^Tx) 0^Tx, where the two experts are parameterized by the w and 0 (zero) vectors. And then you only need to train the expert weights and router just follows along.
13
19
341
@zhaisf
Shuangfei Zhai
4 months
The manifold tangent classifier is probably one of my favorite pre-AlexNet DL papers, and it was the first convincing result that I ever saw about manifold learning (though there were many great manifold learning papers prior to it). The basic idea is that data lives on low
4
32
248
@zhaisf
Shuangfei Zhai
4 months
One anecdote from the talk. BN first introduced the normalization form of y = gamma * normalize(x) + beta with learnable affine transformation gamma and beta, because it’s intended to be used before sigmoid and you don’t always want the logits to be exactly normalized. All later
3
3
27
@zhaisf
Shuangfei Zhai
4 months
Just watched the BatchNorm talk at ICML2025 (test of time award), and it reminds me of the magical feeling when I first learned and tried it 10 years ago. Among many new normalization layers, BN is still more interesting to me because, unlike others, it acts more like a
12
26
463
@zhaisf
Shuangfei Zhai
4 months
Specially, when under trained, the model chooses to fit data points closest to the mean near perfectly, rather than uniformly spreading its errors across the training range. When overtrained, it shows no signs of extrapolation although it has seen plenty of periodic patterns.
8
6
179
@zhaisf
Shuangfei Zhai
4 months
A good way of understanding the inductive bias of neural nets is to train an MLP to regress to the sin(x) function. Below is training on 10K points in [-20, 20], predicting over [-30, 30] after 10 and 100 epochs of training. The implications shown here are surprisingly general.
51
71
923