Shuangfei Zhai @zhaisf X Profile

Shuangfei Zhai

@zhaisf

Followers

2K

Following

189

Media

23

Statuses

143

Research Scientist & Manager, Machine Learning Research @ Apple

Cupertino

Joined April 2010

Don't wanna be here? Send us removal request.

Shuangfei Zhai

@zhaisf

1 year

We attempted to make Normalizing Flows work really well, and we are happy to report our findings in paper https://t.co/SbdXPZy6ed, and code https://t.co/CMOo3svcPK. [1/n]

4

47

231

Shuangfei Zhai

@zhaisf

10 hours

It’s pretty incredible that as intelligent as the models are, they are all built on top of something as dumb as BPE.

kalomaze

@kalomaze

1 day

a helpful reminder that - while BPE may be used to tokenize autoregressive language models, - BPE merge rules sure as hell don't strictly follow autoregressive dependencies

0

1

11

Shuangfei Zhai

@zhaisf

1 day

Check out the new addition to our TarFlow franchise. TLDR: normalizing flows “just work” for generating videos. This adds another strong evidence to our argument that NFs are capable generative models; and I’m now more convinced than ever that they will continue working better.

Jiatao Gu🛩️NeurIPS2025

@thoma_gu

11 days

STARFlow gets an upgrade—it now works on videos🎥 We present STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flows, a invertible, causal video generator built on autoregressive flows! 📄 Paper https://t.co/fHApEwGg8j 💻 Code https://t.co/ATU9XtacsQ (1/10)

0

13

65

Tianrong Chen 陈天荣

@iamct_r

2 months

Here is our niche accepted paper ( https://t.co/dJoM1UitWk) in NeurIPS2025 :) This is also a summary of previous augmented diffusion models (PFGM, CLD, AGM, etc.), and investigates in which cases, if any, they are actually useful in practice.

1

8

20

Shuangfei Zhai

@zhaisf

3 months

Three generative modeling papers from my team accepted to #NeurIPS2025, two on TarFlow and one on Diffusion. 1. StarFlow (Spotlight) https://t.co/ynB2z4Xgfa, scales TarFlow in latent space and demonstrates unprecedented sample quality from pure NF models. Work led by @thoma_gu.

arxiv.org

We present STARFlow, a scalable generative model based on normalizing flows that achieves strong performance in high-resolution image synthesis. The core of STARFlow is Transformer Autoregressive...

5

26

260

Shuangfei Zhai

@zhaisf

3 months

So there is a type of RNN that follows a simple form y_{t+1} = y_t + f(y_t, x_t). It has a state size that scales to billions of dimensions, handles sequence lengths in 10s of millions and has excellent memorization capability. You might have guessed it, I'm talking about SGD.

5

16

226

Shuangfei Zhai

@zhaisf

3 months

Beta-VAE is yet another example of the scalar * loss case. The truth is you don’t actually need any new formulation to introduce the beta. You can simply let the decoder p(x|z)’s variance be beta instead of 1, and the VAE loss becomes 1/beta * (recon_loss + beta * kl_loss).

Shuangfei Zhai

@zhaisf

3 months

Loss functions, as principled as they might appear, are mostly made up surrogates. And, their values don’t matter, it’s the direction of derivatives that we care about — a silly example is that multiplying any loss with a positive scalar yields a different but equally good loss.

1

2

17

Shuangfei Zhai

@zhaisf

3 months

So you can even directly design the training signal without specifying a loss function. Eg, starting from the standard cross entropy loss CE(f, y), with d CE(f, y) / df = SM(f) - y. Now we prescribe a different “derivative” in the form of sign(SM(f) - y) * (SM(f) - y)^p. With

Shuangfei Zhai

@zhaisf

3 months

Loss functions, as principled as they might appear, are mostly made up surrogates. And, their values don’t matter, it’s the direction of derivatives that we care about — a silly example is that multiplying any loss with a positive scalar yields a different but equally good loss.

0

1

21

Shuangfei Zhai

@zhaisf

3 months

A less trivial example: GANs can be thought of a model trained with a learned loss. Meaning, the generator is trained by min_g -D(g(z)), where D is discriminator which is also learned. Like what we are saying above, all we care about is the direction of d D(x) / dx, not D(x)'s

0

8

Shuangfei Zhai

@zhaisf

3 months

Loss functions, as principled as they might appear, are mostly made up surrogates. And, their values don’t matter, it’s the direction of derivatives that we care about — a silly example is that multiplying any loss with a positive scalar yields a different but equally good loss.

11

4

135

Shuangfei Zhai

@zhaisf

3 months

PS, like many of my previous posts, these are just ML101 and are not supposed to be surprising :) But I find great fun in playing with them with the help of modern AI tools, in this case google colab+gemini, with the help of Claude, and it’s become a new hobby of mine.

1

0

6

Shuangfei Zhai

@zhaisf

3 months

We can also visualize the distribution of the influence vector for each class. As you might have guessed, it's extremely polarized -- the majority of examples don't contribute to the classifier weights, and a few hard examples dominate (again, this is a lot like support vectors).

0

5

Shuangfei Zhai

@zhaisf

3 months

If you train a softmax NN classifier with plain SGD, the classifier’s weights have a very simple update trajectory. At t-th iteration for the k-th class, W_k^{t+1} = W_k ^t + eta \sum_{i=1}^B (y^i_k - p(x^i)_k) h^t(x^i) / B, where y is the one hot label, p is the softmax

4

29

Shuangfei Zhai

@zhaisf

3 months

In case the ReLU example sounds trivial, consider two layers instead: y = W1 relu(W0 x). You can rewrite it as y = \sum_{k=1}^{2^d} g_k(x) W_k x. Here W_k is composed by W1 diag(m_k) W0, with m_k being a binary mask and there are 2^d of them; g(x) is the one hot representation of

Shuangfei Zhai

@zhaisf

3 months

ReLU is essentially an element wise MoE with shared router and expert weights: y= I(w^Tx >= 0^Tx) w^Tx + I(w^Tx < 0^Tx) 0^Tx, where the two experts are parameterized by the w and 0 (zero) vectors. And then you only need to train the expert weights and router just follows along.

0

1

21

Shuangfei Zhai

@zhaisf

3 months

Fun history lesson!

Charlie Tang

@tang_1c

3 months

😎 Fun fact: ReLUs were originally invented in 2010 to improve Gaussian Restricted Boltzmann Machines, a type of *generative* model for which @geoffreyhinton won the Nobel prize. (more 👇)

2

0

21

Shuangfei Zhai

@zhaisf

3 months

ReLU is essentially an element wise MoE with shared router and expert weights: y= I(w^Tx >= 0^Tx) w^Tx + I(w^Tx < 0^Tx) 0^Tx, where the two experts are parameterized by the w and 0 (zero) vectors. And then you only need to train the expert weights and router just follows along.

13

19

341

Shuangfei Zhai

@zhaisf

4 months

The manifold tangent classifier is probably one of my favorite pre-AlexNet DL papers, and it was the first convincing result that I ever saw about manifold learning (though there were many great manifold learning papers prior to it). The basic idea is that data lives on low

4

32

248

Shuangfei Zhai

@zhaisf

4 months

One anecdote from the talk. BN first introduced the normalization form of y = gamma * normalize(x) + beta with learnable affine transformation gamma and beta, because it’s intended to be used before sigmoid and you don’t always want the logits to be exactly normalized. All later

3

27

Shuangfei Zhai

@zhaisf

4 months

Just watched the BatchNorm talk at ICML2025 (test of time award), and it reminds me of the magical feeling when I first learned and tried it 10 years ago. Among many new normalization layers, BN is still more interesting to me because, unlike others, it acts more like a

12

26

463

Shuangfei Zhai

@zhaisf

4 months

Specially, when under trained, the model chooses to fit data points closest to the mean near perfectly, rather than uniformly spreading its errors across the training range. When overtrained, it shows no signs of extrapolation although it has seen plenty of periodic patterns.

8

6

179

Shuangfei Zhai

@zhaisf

4 months

A good way of understanding the inductive bias of neural nets is to train an MLP to regress to the sin(x) function. Below is training on 10K points in [-20, 20], predicting over [-30, 30] after 10 and 100 epochs of training. The implications shown here are surprisingly general.

51

71

923