@jm_alexia
Alexia Jolicoeur-Martineau
1 year
In this new joint work (Emy Gervais, @FatrasKilian , @Cyanogenoid , @SimonLacosteJ ), we show the massive power of averaging the weights of multiple models trained simultaneously! 🤯 Blog: Website: Arxiv:
14
52
231

Replies

@jm_alexia
Alexia Jolicoeur-Martineau
1 year
We consider weight averaging as an alternative to ensembling, but averaging neural network weights tends to perform poorly. The key insight is that weight averaging is beneficial when weights are similar enough to average well but different enough to benefit from combining them.
2
1
9
@jm_alexia
Alexia Jolicoeur-Martineau
1 year
We propose PopulAtion Parameter Averaging (PAPA). In PAPA, we train p networks, and we 1) occasionally replace the weights of the models by the population average during training (PAPA-all) or 2) push the models toward the population average at every few steps (PAPA-gradual)
1
2
11
@jm_alexia
Alexia Jolicoeur-Martineau
1 year
We obtain amazing results with PAPA, increasing the average accuracy of models by 1.1% on CIFAR-10 (5-10 networks), 2.4% on CIFAR-100 (5-10 networks), and 1.9% on Imagenet (2-3 networks). PAPA provides a simple way of leveraging a population of networks to improve performance.
1
0
6
@jm_alexia
Alexia Jolicoeur-Martineau
1 year
In practice, PAPA performed better than training a single model for p times more epochs. Thus, PAPA could provide a better and more efficient way of training large models on extensive data by parallelizing the training length over multiple PAPA networks trained for less time.
2
0
13
@peter_richtarik
Peter Richtarik
1 year
@jm_alexia @FatrasKilian @Cyanogenoid @SimonLacosteJ Looks closely related to what we are doing here: We have n agents/workers each of which is trying to train their own model. Every now and then, a step towards the average of these models is taken. We call the method L2GD. It is very similar to local GD.
1
0
2
@jm_alexia
Alexia Jolicoeur-Martineau
1 year
@peter_richtarik @FatrasKilian @Cyanogenoid @SimonLacosteJ Hi Peter, it's cool to see a formulation like PAPA in federated learning! I only realized yesterday that we were solving the L2-norm as my initial intuition was different. We will add your paper to the related work in the new update. See the attached image for the differences.
Tweet media one
1
0
0
@dataengines
Data Engines
1 year
@jm_alexia @FatrasKilian @Cyanogenoid @SimonLacosteJ We love the idea. In the context of evaluation, have you considered how the optimal combination would be decided in an unlabeled setting?
1
0
1
@jm_alexia
Alexia Jolicoeur-Martineau
1 year
@dataengines @FatrasKilian @Cyanogenoid @SimonLacosteJ In practice non-uniform weights don't do well. I recommend ensuring good hyperparameters and averaging everyone even if a few models perform slightly less well. The unlabeled setting is super cool, there might be interesting links to be made with methods like BYOL.
0
0
1
@OriSenbazuru
Orizuru
1 year
@jm_alexia @FatrasKilian @Cyanogenoid @SimonLacosteJ interesting work - it reminds me of federated averaging (FedAvg) and the FedProx variation where you restrict the separate model weights (on different clients) from diverging from the central model (in this case population average).
1
0
1
@jm_alexia
Alexia Jolicoeur-Martineau
1 year
@OriSenbazuru @FatrasKilian @Cyanogenoid @SimonLacosteJ Yes, there are links to both! We mention them in the paper. The second one is Consensus Optimization. The main differences are 1) we give the full-data (plus data-augmentations) to each model rather than split the data and 2) we don't seek convergence to the central model.
0
0
1
@jm_alexia
Alexia Jolicoeur-Martineau
1 year
Oops, its not "Website", but "Code" 😹 Code:
0
0
3
@PandaAshwinee
Ashwinee Panda
1 year
@jm_alexia @FatrasKilian @Cyanogenoid @SimonLacosteJ I tried this a few years ago for RL models and it also worked fantastically: However I wasn’t able to extend the same results to CIFAR, it’s interesting to see that it works for CV as well! I will give the code a spin.
1
0
2
@jm_alexia
Alexia Jolicoeur-Martineau
1 year
0
0
0
@gklambauer
Günter Klambauer
1 year
@jm_alexia @FatrasKilian @Cyanogenoid @SimonLacosteJ .. and you managed to write this paper without any mention of flat minima... :)
1
1
3
@jm_alexia
Alexia Jolicoeur-Martineau
1 year
@gklambauer @FatrasKilian @Cyanogenoid @SimonLacosteJ Why would anyone want to be stuck in flat minima? lol Unlike the permutation alignment papers, we actually get lower train/test loss after interpolation, this is what we want, not flatness.
0
0
4
@avinab_saha
Avinab Saha 🇮🇳
1 year
@jm_alexia @FatrasKilian @Cyanogenoid @SimonLacosteJ Interesting work @jm_alexia ! Did folks also experiment with Transformer Architectures?
1
0
1
@jm_alexia
Alexia Jolicoeur-Martineau
1 year
@avinab_saha @FatrasKilian @Cyanogenoid @SimonLacosteJ I'm working on improving PAPA now, once we have something good the plan is to try 1) Transformers, 2) GAN/diffusion. Btw, if you have any suggestions of transformer architectures (ideally for vision) feel free to suggest any, I'm not used to working with transformers.
1
0
0
@yalishandi
yalishandi
1 year
1
0
2
@jm_alexia
Alexia Jolicoeur-Martineau
1 year
0
0
1
@Raiden13238619
Tongtian Zhu
1 year
@jm_alexia @FatrasKilian @Cyanogenoid @SimonLacosteJ Cool! One interesting point: turns out we don't need to increase learning rate much when using large-batch PAPA. This is probably because in large-batch SGD, the gradient noise (which could help generalization) stays low, but in PAPA, it's still non-negligible due to diversity.
2
0
1
@jm_alexia
Alexia Jolicoeur-Martineau
1 year
@Raiden13238619 @FatrasKilian @Cyanogenoid @SimonLacosteJ I guess it depends how you define large-batch. If you use k models with batch-size b, it's more noisy than 1 model with batch-size b*k. But PAPA doesn't seperate the batch into its workers, they all have a mini-batch of their own data. For fixed b and k, you increase lr propto b.
1
0
0