David Grangier @GrangierDavid X Profile

David Grangier

@GrangierDavid

Followers

442

Following

23

Media

9

Statuses

35

ML research with practical impact.

Joined December 2019

Don't wanna be here? Send us removal request.

David Grangier

@GrangierDavid

6 months

#ICLR #TrainLLMBetter Tomorrow, #soup of experts, an #hypernetwork conditioned on a simple description of the test distribution: adaptation without retraining (Modularity workshop Sunday). https://t.co/Cc72NyyJpI Still on today... CRISP Importance Sampling for LLM pretraining.

0

1

David Grangier

@GrangierDavid

6 months

3/3 Mixture of experts on high latency networks with No Need to Talk https://t.co/sMPj55XdDp (Thu Apr 24 3pm). Joint work with @MatPagliardini, @NasFilippova,@PierreAblin @olivia61368522, Skyler Seto, @angeloskath, Ronan Collobert

0

2

David Grangier

@GrangierDavid

6 months

2/3 Importance sampling for better pretraining distribution with CRISP https://t.co/ShxRrGMkDB (Sat Apr 26, 10 am).

1

0

2

David Grangier

@GrangierDavid

6 months

#ICLR #TrainBetterLM I am at ICLR, come to our posters for improved language model training! Recycle gradients for faster neural net training with AdEMAmix https://t.co/eR3r0TSRJH (Fri Apr 25, 10 am). 1/3

1

8

David Grangier

@GrangierDavid

1 year

⦿ Efficient, scalable approach on LM and Q&A domains. ⦿ Single & multitask. ⦿ Pretraining & continued pretraining. ⦿ Ablations on data size, model size... https://t.co/k0EMaZiQfN 4/4

0

David Grangier

@GrangierDavid

1 year

🚀Easy with clustered importance sampling: 1️⃣ cluster the generalist dataset, 2️⃣ resample the clusters w/ their prior from tiny specialist data, 3️⃣ Done! 🏁 3/4

1

0

1

David Grangier

@GrangierDavid

1 year

Generalist LLM needs scale: ➡️ large models to fit large generic training sets. Specialist LLM can be more efficient: ➡️ small model are accurate when addressing few tasks. But specialist LLM needs to be trained on specialist data. 🤔 What to do if such data is lacking? 2/4

1

0

David Grangier

@GrangierDavid

1 year

New paper! https://t.co/k0EMaZiQfN Clustered importance sampling to build specialist Language Models (LMs) 🤔 Build a specialist LM with very little specialist data 💡How? Generalist data + efficient, scalable importance sampling w/ @Olivia61368522+SkylerSeto+@PierreAblin 1/4

1

14

25

David Grangier

@GrangierDavid

1 year

Ademamix optimizer for jax/pytorch: change one line of code, train your model faster.

Pierre Ablin

@PierreAblin

1 year

🎇Official pytorch/jax implementation of Ademamix🎇 https://t.co/fPQRioY9M0 Drop-in replacement for AdamW, much faster LLM pre-training! 🚀🚀🚀🚀

0

4

10

David Grangier

@GrangierDavid

1 year

Faster, better model training by reusing old gradients (>10k steps ago) with negligible extra computation? Count me in.

Matteo Pagliardini

@MatPagliardini

1 year

Stop discarding your old gradients! Introducing AdEMAMix, a novel (first-order) optimizer capable of outperforming Adam. Let’s have a thread on momentum and the surprising relevance of very old gradients. A joint work with @GrangierDavid and @PierreAblin #ml #optimization 1/🧵

1

9

61

David Grangier

@GrangierDavid

1 year

2/2 PN is a high capacity network whose parameters can be linearly projected into a small network. This strategy enables both high capacity and efficient inference. See details at our poster on Friday morning and afternoon. https://t.co/wdtXz9n3yb https://t.co/q4v86N4Wjq

openreview.net

Large language models are versatile tools but are not suitable for small inference budgets. Small models have more efficient inference but their lower capacity means that their performance can be...

0

David Grangier

@GrangierDavid

1 year

At ICML? Learn about our efficient projected language models! Adding capacity to a traditional language model improves accuracy but increases inference cost. How to avoid this? We propose a novel architecture, projected networks (PN).

3

5

27

David Grangier

@GrangierDavid

2 years

With Angelos Katharopoulos, Pierre Ablin, Awni Hannun.

0

David Grangier

@GrangierDavid

2 years

2/2 Findings: when the application-specific training budget is large, importance sampling is great. Otherwise, asymmetric models (big at train, small at inference e.g. mixture of experts or hyper-networks) are attractive, better than the popular distillation strategy.

1

0

1

David Grangier

@GrangierDavid

2 years

New language model work! In practice, LMs often face a double constraint (i) small inference budget + (ii) little application-specific data: (i) means small specialized models for inference; (ii) means using auxiliary generic data e.g. for pretraining 1/2 https://t.co/E7MrinEcLq

1

13

35

David Grangier

@GrangierDavid

2 years

Our analysis proposes a simple test to check if our method applies to your problem. Chat with us at our poster at #neurips2023 DistShift workshop next week. Joint work with Pierre Ablin, Awni Hannun. (3/3)

0

3

David Grangier

@GrangierDavid

2 years

Large models are often trained on massive web datasets and a bit of target-task data. In this setup, it is 👍 to spend more train effort on specific parts of the large set. Our online algorithm maintains an auxiliary cheap filter model when training the large model. (2/3)

1

0

3

David Grangier

@GrangierDavid

2 years

Efficient bilevel algorithm for training data selection https://t.co/dGWDOin2BJ #bilevel #data_selection #DomainAdaptation #distshift #llm #NeurIPS2023 Online algorithm for filtering large (pre)training sets with maximal impact on the targeted task. (1/3)

1

19

81

David Grangier

@GrangierDavid

3 years

A Natural Diet: Towards Improving Naturalness of Machine Translation. Freitag, Vilar, Grangier, Cherry and Foster https://t.co/blbYanprbY We study how to generate translation which appears more natural, i.e. like non-translated text initially written in the target language. 4/4

0

2

David Grangier

@GrangierDavid

3 years

Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. Freitag, Foster, Grangier, Ratnakar, Tan, Macherey https://t.co/0qNeCNjkfv Evaluation of machine translation with tools designed to evaluate high quality human translation. 3/4.

1

0

3