In this new joint work (Emy Gervais,
@FatrasKilian
,
@Cyanogenoid
,
@SimonLacosteJ
), we show the massive power of averaging the weights of multiple models trained simultaneously! 🤯
Blog:
Website:
Arxiv:
We consider weight averaging as an alternative to ensembling, but averaging neural network weights tends to perform poorly.
The key insight is that weight averaging is beneficial when weights are similar enough to average well but different enough to benefit from combining them.
We propose PopulAtion Parameter Averaging (PAPA).
In PAPA, we train p networks, and we
1) occasionally replace the weights of the models by the population average during training (PAPA-all)
or
2) push the models toward the population average at every few steps (PAPA-gradual)
We obtain amazing results with PAPA, increasing the average accuracy of models by 1.1% on CIFAR-10 (5-10 networks), 2.4% on CIFAR-100 (5-10 networks), and 1.9% on Imagenet (2-3 networks).
PAPA provides a simple way of leveraging a population of networks to improve performance.
In practice, PAPA performed better than training a single model for p times more epochs.
Thus, PAPA could provide a better and more efficient way of training large models on extensive data by parallelizing the training length over multiple PAPA networks trained for less time.
@jm_alexia
@FatrasKilian
@Cyanogenoid
@SimonLacosteJ
Looks closely related to what we are doing here:
We have n agents/workers each of which is trying to train their own model. Every now and then, a step towards the average of these models is taken. We call the method L2GD. It is very similar to local GD.
@peter_richtarik
@FatrasKilian
@Cyanogenoid
@SimonLacosteJ
Hi Peter, it's cool to see a formulation like PAPA in federated learning! I only realized yesterday that we were solving the L2-norm as my initial intuition was different. We will add your paper to the related work in the new update. See the attached image for the differences.
@dataengines
@FatrasKilian
@Cyanogenoid
@SimonLacosteJ
In practice non-uniform weights don't do well. I recommend ensuring good hyperparameters and averaging everyone even if a few models perform slightly less well.
The unlabeled setting is super cool, there might be interesting links to be made with methods like BYOL.
@jm_alexia
@FatrasKilian
@Cyanogenoid
@SimonLacosteJ
interesting work - it reminds me of federated averaging (FedAvg) and the FedProx variation where you restrict the separate model weights (on different clients) from diverging from the central model (in this case population average).
@OriSenbazuru
@FatrasKilian
@Cyanogenoid
@SimonLacosteJ
Yes, there are links to both! We mention them in the paper. The second one is Consensus Optimization.
The main differences are 1) we give the full-data (plus data-augmentations) to each model rather than split the data and 2) we don't seek convergence to the central model.
@jm_alexia
@FatrasKilian
@Cyanogenoid
@SimonLacosteJ
I tried this a few years ago for RL models and it also worked fantastically:
However I wasn’t able to extend the same results to CIFAR, it’s interesting to see that it works for CV as well! I will give the code a spin.
@gklambauer
@FatrasKilian
@Cyanogenoid
@SimonLacosteJ
Why would anyone want to be stuck in flat minima? lol Unlike the permutation alignment papers, we actually get lower train/test loss after interpolation, this is what we want, not flatness.
@avinab_saha
@FatrasKilian
@Cyanogenoid
@SimonLacosteJ
I'm working on improving PAPA now, once we have something good the plan is to try 1) Transformers, 2) GAN/diffusion.
Btw, if you have any suggestions of transformer architectures (ideally for vision) feel free to suggest any, I'm not used to working with transformers.
@jm_alexia
@FatrasKilian
@Cyanogenoid
@SimonLacosteJ
Cool! One interesting point: turns out we don't need to increase learning rate much when using large-batch PAPA. This is probably because in large-batch SGD, the gradient noise (which could help generalization) stays low, but in PAPA, it's still non-negligible due to diversity.
@Raiden13238619
@FatrasKilian
@Cyanogenoid
@SimonLacosteJ
I guess it depends how you define large-batch. If you use k models with batch-size b, it's more noisy than 1 model with batch-size b*k. But PAPA doesn't seperate the batch into its workers, they all have a mini-batch of their own data. For fixed b and k, you increase lr propto b.