Shuangfei Zhai
@zhaisf
Followers
2K
Following
189
Media
23
Statuses
143
Research Scientist & Manager, Machine Learning Research @ Apple
Cupertino
Joined April 2010
We attempted to make Normalizing Flows work really well, and we are happy to report our findings in paper https://t.co/SbdXPZy6ed, and code https://t.co/CMOo3svcPK. [1/n]
4
47
231
Check out the new addition to our TarFlow franchise. TLDR: normalizing flows “just work” for generating videos. This adds another strong evidence to our argument that NFs are capable generative models; and I’m now more convinced than ever that they will continue working better.
STARFlow gets an upgrade—it now works on videos🎥 We present STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flows, a invertible, causal video generator built on autoregressive flows! 📄 Paper https://t.co/fHApEwGg8j 💻 Code https://t.co/ATU9XtacsQ (1/10)
0
13
65
Here is our niche accepted paper ( https://t.co/dJoM1UitWk) in NeurIPS2025 :) This is also a summary of previous augmented diffusion models (PFGM, CLD, AGM, etc.), and investigates in which cases, if any, they are actually useful in practice.
1
8
20
Three generative modeling papers from my team accepted to #NeurIPS2025, two on TarFlow and one on Diffusion. 1. StarFlow (Spotlight) https://t.co/ynB2z4Xgfa, scales TarFlow in latent space and demonstrates unprecedented sample quality from pure NF models. Work led by @thoma_gu.
arxiv.org
We present STARFlow, a scalable generative model based on normalizing flows that achieves strong performance in high-resolution image synthesis. The core of STARFlow is Transformer Autoregressive...
5
26
260
So there is a type of RNN that follows a simple form y_{t+1} = y_t + f(y_t, x_t). It has a state size that scales to billions of dimensions, handles sequence lengths in 10s of millions and has excellent memorization capability. You might have guessed it, I'm talking about SGD.
5
16
226
Beta-VAE is yet another example of the scalar * loss case. The truth is you don’t actually need any new formulation to introduce the beta. You can simply let the decoder p(x|z)’s variance be beta instead of 1, and the VAE loss becomes 1/beta * (recon_loss + beta * kl_loss).
Loss functions, as principled as they might appear, are mostly made up surrogates. And, their values don’t matter, it’s the direction of derivatives that we care about — a silly example is that multiplying any loss with a positive scalar yields a different but equally good loss.
1
2
17
So you can even directly design the training signal without specifying a loss function. Eg, starting from the standard cross entropy loss CE(f, y), with d CE(f, y) / df = SM(f) - y. Now we prescribe a different “derivative” in the form of sign(SM(f) - y) * (SM(f) - y)^p. With
Loss functions, as principled as they might appear, are mostly made up surrogates. And, their values don’t matter, it’s the direction of derivatives that we care about — a silly example is that multiplying any loss with a positive scalar yields a different but equally good loss.
0
1
21
A less trivial example: GANs can be thought of a model trained with a learned loss. Meaning, the generator is trained by min_g -D(g(z)), where D is discriminator which is also learned. Like what we are saying above, all we care about is the direction of d D(x) / dx, not D(x)'s
0
0
8
Loss functions, as principled as they might appear, are mostly made up surrogates. And, their values don’t matter, it’s the direction of derivatives that we care about — a silly example is that multiplying any loss with a positive scalar yields a different but equally good loss.
11
4
135
PS, like many of my previous posts, these are just ML101 and are not supposed to be surprising :) But I find great fun in playing with them with the help of modern AI tools, in this case google colab+gemini, with the help of Claude, and it’s become a new hobby of mine.
1
0
6
We can also visualize the distribution of the influence vector for each class. As you might have guessed, it's extremely polarized -- the majority of examples don't contribute to the classifier weights, and a few hard examples dominate (again, this is a lot like support vectors).
0
0
5
If you train a softmax NN classifier with plain SGD, the classifier’s weights have a very simple update trajectory. At t-th iteration for the k-th class, W_k^{t+1} = W_k ^t + eta \sum_{i=1}^B (y^i_k - p(x^i)_k) h^t(x^i) / B, where y is the one hot label, p is the softmax
4
4
29
In case the ReLU example sounds trivial, consider two layers instead: y = W1 relu(W0 x). You can rewrite it as y = \sum_{k=1}^{2^d} g_k(x) W_k x. Here W_k is composed by W1 diag(m_k) W0, with m_k being a binary mask and there are 2^d of them; g(x) is the one hot representation of
ReLU is essentially an element wise MoE with shared router and expert weights: y= I(w^Tx >= 0^Tx) w^Tx + I(w^Tx < 0^Tx) 0^Tx, where the two experts are parameterized by the w and 0 (zero) vectors. And then you only need to train the expert weights and router just follows along.
0
1
21
Fun history lesson!
😎 Fun fact: ReLUs were originally invented in 2010 to improve Gaussian Restricted Boltzmann Machines, a type of *generative* model for which @geoffreyhinton won the Nobel prize. (more 👇)
2
0
21
ReLU is essentially an element wise MoE with shared router and expert weights: y= I(w^Tx >= 0^Tx) w^Tx + I(w^Tx < 0^Tx) 0^Tx, where the two experts are parameterized by the w and 0 (zero) vectors. And then you only need to train the expert weights and router just follows along.
13
19
341
The manifold tangent classifier is probably one of my favorite pre-AlexNet DL papers, and it was the first convincing result that I ever saw about manifold learning (though there were many great manifold learning papers prior to it). The basic idea is that data lives on low
4
32
248
One anecdote from the talk. BN first introduced the normalization form of y = gamma * normalize(x) + beta with learnable affine transformation gamma and beta, because it’s intended to be used before sigmoid and you don’t always want the logits to be exactly normalized. All later
3
3
27
Just watched the BatchNorm talk at ICML2025 (test of time award), and it reminds me of the magical feeling when I first learned and tried it 10 years ago. Among many new normalization layers, BN is still more interesting to me because, unlike others, it acts more like a
12
26
463
Specially, when under trained, the model chooses to fit data points closest to the mean near perfectly, rather than uniformly spreading its errors across the training range. When overtrained, it shows no signs of extrapolation although it has seen plenty of periodic patterns.
8
6
179
A good way of understanding the inductive bias of neural nets is to train an MLP to regress to the sin(x) function. Below is training on 10K points in [-20, 20], predicting over [-30, 30] after 10 and 100 epochs of training. The implications shown here are surprisingly general.
51
71
923