David Grangier Profile
David Grangier

@GrangierDavid

Followers
442
Following
23
Media
9
Statuses
35

ML research with practical impact.

Joined December 2019
Don't wanna be here? Send us removal request.
@GrangierDavid
David Grangier
6 months
#ICLR #TrainLLMBetter Tomorrow, #soup of experts, an #hypernetwork conditioned on a simple description of the test distribution: adaptation without retraining (Modularity workshop Sunday). https://t.co/Cc72NyyJpI Still on today... CRISP Importance Sampling for LLM pretraining.
0
0
1
@GrangierDavid
David Grangier
6 months
3/3 Mixture of experts on high latency networks with No Need to Talk https://t.co/sMPj55XdDp (Thu Apr 24 3pm). Joint work with @MatPagliardini, @NasFilippova,@PierreAblin @olivia61368522, Skyler Seto, @angeloskath, Ronan Collobert
0
0
2
@GrangierDavid
David Grangier
6 months
2/3 Importance sampling for better pretraining distribution with CRISP https://t.co/ShxRrGMkDB (Sat Apr 26, 10 am).
1
0
2
@GrangierDavid
David Grangier
6 months
#ICLR #TrainBetterLM I am at ICLR, come to our posters for improved language model training! Recycle gradients for faster neural net training with AdEMAmix https://t.co/eR3r0TSRJH (Fri Apr 25, 10 am). 1/3
1
1
8
@GrangierDavid
David Grangier
1 year
⦿ Efficient, scalable approach on LM and Q&A domains. ⦿ Single & multitask. ⦿ Pretraining & continued pretraining. ⦿ Ablations on data size, model size... https://t.co/k0EMaZiQfN 4/4
0
0
0
@GrangierDavid
David Grangier
1 year
🚀Easy with clustered importance sampling: 1️⃣ cluster the generalist dataset, 2️⃣ resample the clusters w/ their prior from tiny specialist data, 3️⃣ Done! 🏁 3/4
1
0
1
@GrangierDavid
David Grangier
1 year
Generalist LLM needs scale: ➡️ large models to fit large generic training sets. Specialist LLM can be more efficient: ➡️ small model are accurate when addressing few tasks. But specialist LLM needs to be trained on specialist data. 🤔 What to do if such data is lacking? 2/4
1
0
0
@GrangierDavid
David Grangier
1 year
New paper! https://t.co/k0EMaZiQfN Clustered importance sampling to build specialist Language Models (LMs) 🤔 Build a specialist LM with very little specialist data 💡How? Generalist data + efficient, scalable importance sampling w/ @Olivia61368522+SkylerSeto+@PierreAblin 1/4
1
14
25
@GrangierDavid
David Grangier
1 year
Ademamix optimizer for jax/pytorch: change one line of code, train your model faster.
@PierreAblin
Pierre Ablin
1 year
🎇Official pytorch/jax implementation of Ademamix🎇 https://t.co/fPQRioY9M0 Drop-in replacement for AdamW, much faster LLM pre-training! 🚀🚀🚀🚀
0
4
10
@GrangierDavid
David Grangier
1 year
Faster, better model training by reusing old gradients (>10k steps ago) with negligible extra computation? Count me in.
@MatPagliardini
Matteo Pagliardini
1 year
Stop discarding your old gradients! Introducing AdEMAMix, a novel (first-order) optimizer capable of outperforming Adam. Let’s have a thread on momentum and the surprising relevance of very old gradients. A joint work with @GrangierDavid and @PierreAblin #ml #optimization 1/🧵
1
9
61
@GrangierDavid
David Grangier
1 year
2/2 PN is a high capacity network whose parameters can be linearly projected into a small network. This strategy enables both high capacity and efficient inference. See details at our poster on Friday morning and afternoon. https://t.co/wdtXz9n3yb https://t.co/q4v86N4Wjq
openreview.net
Large language models are versatile tools but are not suitable for small inference budgets. Small models have more efficient inference but their lower capacity means that their performance can be...
0
0
0
@GrangierDavid
David Grangier
1 year
At ICML? Learn about our efficient projected language models! Adding capacity to a traditional language model improves accuracy but increases inference cost. How to avoid this? We propose a novel architecture, projected networks (PN).
3
5
27
@GrangierDavid
David Grangier
2 years
With Angelos Katharopoulos, Pierre Ablin, Awni Hannun.
0
0
0
@GrangierDavid
David Grangier
2 years
2/2 Findings: when the application-specific training budget is large, importance sampling is great. Otherwise, asymmetric models (big at train, small at inference e.g. mixture of experts or hyper-networks) are attractive, better than the popular distillation strategy.
1
0
1
@GrangierDavid
David Grangier
2 years
New language model work! In practice, LMs often face a double constraint (i) small inference budget + (ii) little application-specific data: (i) means small specialized models for inference; (ii) means using auxiliary generic data e.g. for pretraining 1/2 https://t.co/E7MrinEcLq
1
13
35
@GrangierDavid
David Grangier
2 years
Our analysis proposes a simple test to check if our method applies to your problem. Chat with us at our poster at #neurips2023 DistShift workshop next week. Joint work with Pierre Ablin, Awni Hannun. (3/3)
0
0
3
@GrangierDavid
David Grangier
2 years
Large models are often trained on massive web datasets and a bit of target-task data. In this setup, it is 👍 to spend more train effort on specific parts of the large set. Our online algorithm maintains an auxiliary cheap filter model when training the large model. (2/3)
1
0
3
@GrangierDavid
David Grangier
2 years
Efficient bilevel algorithm for training data selection https://t.co/dGWDOin2BJ #bilevel #data_selection #DomainAdaptation #distshift #llm #NeurIPS2023 Online algorithm for filtering large (pre)training sets with maximal impact on the targeted task. (1/3)
1
19
81
@GrangierDavid
David Grangier
3 years
A Natural Diet: Towards Improving Naturalness of Machine Translation. Freitag, Vilar, Grangier, Cherry and Foster https://t.co/blbYanprbY We study how to generate translation which appears more natural, i.e. like non-translated text initially written in the target language. 4/4
0
0
2
@GrangierDavid
David Grangier
3 years
Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. Freitag, Foster, Grangier, Ratnakar, Tan, Macherey https://t.co/0qNeCNjkfv Evaluation of machine translation with tools designed to evaluate high quality human translation. 3/4.
1
0
3