In this new joint work (Emy Gervais, @FatrasKilian, @Cyanogenoid, @SimonLacosteJ), we show the massive power of averaging the weights of multiple models trained simultaneously! 🤯 Blog: Website: Arxiv: Tweet added by Alexia Jolicoeur-Martineau @jm_alexia

Alexia Jolicoeur-Martineau

1 year

In this new joint work (Emy Gervais, @FatrasKilian , @Cyanogenoid , @SimonLacosteJ ), we show the massive power of averaging the weights of multiple models trained simultaneously! 🤯 Blog: Website: Arxiv:

PopulAtion Parameter Averaging (PAPA)

Ensemble methods combine the predictions of multiple models to improve performance, but they require significantly higher computation costs at inference time. To avoid these costs, multiple neural...

arxiv.org

14

52

231

Alexia Jolicoeur-Martineau

@jm_alexia

1 year

We consider weight averaging as an alternative to ensembling, but averaging neural network weights tends to perform poorly. The key insight is that weight averaging is beneficial when weights are similar enough to average well but different enough to benefit from combining them.

2

1

9

Alexia Jolicoeur-Martineau

@jm_alexia

1 year

We propose PopulAtion Parameter Averaging (PAPA). In PAPA, we train p networks, and we 1) occasionally replace the weights of the models by the population average during training (PAPA-all) or 2) push the models toward the population average at every few steps (PAPA-gradual)

1

2

11

Alexia Jolicoeur-Martineau

@jm_alexia

1 year

We obtain amazing results with PAPA, increasing the average accuracy of models by 1.1% on CIFAR-10 (5-10 networks), 2.4% on CIFAR-100 (5-10 networks), and 1.9% on Imagenet (2-3 networks). PAPA provides a simple way of leveraging a population of networks to improve performance.

1

0

6

Alexia Jolicoeur-Martineau

@jm_alexia

1 year

In practice, PAPA performed better than training a single model for p times more epochs. Thus, PAPA could provide a better and more efficient way of training large models on extensive data by parallelizing the training length over multiple PAPA networks trained for less time.

2

0

13

Peter Richtarik

@peter_richtarik

1 year

@jm_alexia @FatrasKilian @Cyanogenoid @SimonLacosteJ Looks closely related to what we are doing here: We have n agents/workers each of which is trying to train their own model. Every now and then, a step towards the average of these models is taken. We call the method L2GD. It is very similar to local GD.

Federated Learning of a Mixture of Global and Local Models

We propose a new optimization formulation for training federated learning models. The standard formulation has the form of an empirical risk minimization problem constructed to find a single...

arxiv.org

1

0

2

Alexia Jolicoeur-Martineau

@jm_alexia

1 year

@peter_richtarik @FatrasKilian @Cyanogenoid @SimonLacosteJ Hi Peter, it's cool to see a formulation like PAPA in federated learning! I only realized yesterday that we were solving the L2-norm as my initial intuition was different. We will add your paper to the related work in the new update. See the attached image for the differences.

1

0

Data Engines

@dataengines

1 year

@jm_alexia @FatrasKilian @Cyanogenoid @SimonLacosteJ We love the idea. In the context of evaluation, have you considered how the optimal combination would be decided in an unlabeled setting?

1

0

1

Alexia Jolicoeur-Martineau

@jm_alexia

1 year

@dataengines @FatrasKilian @Cyanogenoid @SimonLacosteJ In practice non-uniform weights don't do well. I recommend ensuring good hyperparameters and averaging everyone even if a few models perform slightly less well. The unlabeled setting is super cool, there might be interesting links to be made with methods like BYOL.

0

1

Orizuru

@OriSenbazuru

1 year

@jm_alexia @FatrasKilian @Cyanogenoid @SimonLacosteJ interesting work - it reminds me of federated averaging (FedAvg) and the FedProx variation where you restrict the separate model weights (on different clients) from diverging from the central model (in this case population average).

1

0

1

Alexia Jolicoeur-Martineau

@jm_alexia

1 year

@OriSenbazuru @FatrasKilian @Cyanogenoid @SimonLacosteJ Yes, there are links to both! We mention them in the paper. The second one is Consensus Optimization. The main differences are 1) we give the full-data (plus data-augmentations) to each model rather than split the data and 2) we don't seek convergence to the central model.

0

1

Alexia Jolicoeur-Martineau

@jm_alexia

1 year

Oops, its not "Website", but "Code" 😹 Code:

GitHub - SamsungSAILMontreal/PAPA: Repository for the PopulAtion Parameter Averaging (PAPA) paper

Repository for the PopulAtion Parameter Averaging (PAPA) paper - SamsungSAILMontreal/PAPA

github.com

0

3

Ashwinee Panda

@PandaAshwinee

1 year

@jm_alexia @FatrasKilian @Cyanogenoid @SimonLacosteJ I tried this a few years ago for RL models and it also worked fantastically: However I wasn’t able to extend the same results to CIFAR, it’s interesting to see that it works for CV as well! I will give the code a spin.

1

0

2

Alexia Jolicoeur-Martineau

@jm_alexia

1 year

@PandaAshwinee @FatrasKilian @Cyanogenoid @SimonLacosteJ cool!

0

Günter Klambauer

@gklambauer

1 year

@jm_alexia @FatrasKilian @Cyanogenoid @SimonLacosteJ .. and you managed to write this paper without any mention of flat minima... :)

1

3

Alexia Jolicoeur-Martineau

@jm_alexia

1 year

@gklambauer @FatrasKilian @Cyanogenoid @SimonLacosteJ Why would anyone want to be stuck in flat minima? lol Unlike the permutation alignment papers, we actually get lower train/test loss after interpolation, this is what we want, not flatness.

0

4

Avinab Saha 🇮🇳

@avinab_saha

1 year

@jm_alexia @FatrasKilian @Cyanogenoid @SimonLacosteJ Interesting work @jm_alexia ! Did folks also experiment with Transformer Architectures?

1

0

1

Alexia Jolicoeur-Martineau

@jm_alexia

1 year

@avinab_saha @FatrasKilian @Cyanogenoid @SimonLacosteJ I'm working on improving PAPA now, once we have something good the plan is to try 1) Transformers, 2) GAN/diffusion. Btw, if you have any suggestions of transformer architectures (ideally for vision) feel free to suggest any, I'm not used to working with transformers.

1

0

yalishandi

@yalishandi

1 year

@jm_alexia @FatrasKilian @Cyanogenoid @SimonLacosteJ Kaldi uses something like this

Parallel training of DNNs with Natural Gradient and Parameter Averaging

We describe the neural-network training framework used in the Kaldi speech recognition toolkit, which is geared towards training DNNs with large amounts of training data using multiple...

arxiv.org

1

0

2

Alexia Jolicoeur-Martineau

@jm_alexia

1 year

@yalishandi @FatrasKilian @Cyanogenoid @SimonLacosteJ cool, I wasn't aware of this one.

0

1

Tongtian Zhu

@Raiden13238619

1 year

@jm_alexia @FatrasKilian @Cyanogenoid @SimonLacosteJ Cool! One interesting point: turns out we don't need to increase learning rate much when using large-batch PAPA. This is probably because in large-batch SGD, the gradient noise (which could help generalization) stays low, but in PAPA, it's still non-negligible due to diversity.

2

0

1

Alexia Jolicoeur-Martineau

@jm_alexia

1 year

@Raiden13238619 @FatrasKilian @Cyanogenoid @SimonLacosteJ I guess it depends how you define large-batch. If you use k models with batch-size b, it's more noisy than 1 model with batch-size b*k. But PAPA doesn't seperate the batch into its workers, they all have a mini-batch of their own data. For fixed b and k, you increase lr propto b.

1

0

Replies