
David Grangier
@GrangierDavid
Followers
442
Following
23
Media
9
Statuses
35
ML research with practical impact.
Joined December 2019
#ICLR #TrainLLMBetter Tomorrow, #soup of experts, an #hypernetwork conditioned on a simple description of the test distribution: adaptation without retraining (Modularity workshop Sunday). https://t.co/Cc72NyyJpI Still on today... CRISP Importance Sampling for LLM pretraining.
0
0
1
3/3 Mixture of experts on high latency networks with No Need to Talk https://t.co/sMPj55XdDp (Thu Apr 24 3pm). Joint work with @MatPagliardini, @NasFilippova,@PierreAblin @olivia61368522, Skyler Seto, @angeloskath, Ronan Collobert
0
0
2
2/3 Importance sampling for better pretraining distribution with CRISP https://t.co/ShxRrGMkDB (Sat Apr 26, 10 am).
1
0
2
#ICLR #TrainBetterLM I am at ICLR, come to our posters for improved language model training! Recycle gradients for faster neural net training with AdEMAmix https://t.co/eR3r0TSRJH (Fri Apr 25, 10 am). 1/3
1
1
8
⦿ Efficient, scalable approach on LM and Q&A domains. ⦿ Single & multitask. ⦿ Pretraining & continued pretraining. ⦿ Ablations on data size, model size... https://t.co/k0EMaZiQfN 4/4
0
0
0
🚀Easy with clustered importance sampling: 1️⃣ cluster the generalist dataset, 2️⃣ resample the clusters w/ their prior from tiny specialist data, 3️⃣ Done! 🏁 3/4
1
0
1
Generalist LLM needs scale: ➡️ large models to fit large generic training sets. Specialist LLM can be more efficient: ➡️ small model are accurate when addressing few tasks. But specialist LLM needs to be trained on specialist data. 🤔 What to do if such data is lacking? 2/4
1
0
0
New paper! https://t.co/k0EMaZiQfN Clustered importance sampling to build specialist Language Models (LMs) 🤔 Build a specialist LM with very little specialist data 💡How? Generalist data + efficient, scalable importance sampling w/ @Olivia61368522+SkylerSeto+@PierreAblin 1/4
1
14
25
Ademamix optimizer for jax/pytorch: change one line of code, train your model faster.
🎇Official pytorch/jax implementation of Ademamix🎇 https://t.co/fPQRioY9M0 Drop-in replacement for AdamW, much faster LLM pre-training! 🚀🚀🚀🚀
0
4
10
Faster, better model training by reusing old gradients (>10k steps ago) with negligible extra computation? Count me in.
Stop discarding your old gradients! Introducing AdEMAMix, a novel (first-order) optimizer capable of outperforming Adam. Let’s have a thread on momentum and the surprising relevance of very old gradients. A joint work with @GrangierDavid and @PierreAblin #ml #optimization 1/🧵
1
9
61
2/2 PN is a high capacity network whose parameters can be linearly projected into a small network. This strategy enables both high capacity and efficient inference. See details at our poster on Friday morning and afternoon. https://t.co/wdtXz9n3yb
https://t.co/q4v86N4Wjq
openreview.net
Large language models are versatile tools but are not suitable for small inference budgets. Small models have more efficient inference but their lower capacity means that their performance can be...
0
0
0
At ICML? Learn about our efficient projected language models! Adding capacity to a traditional language model improves accuracy but increases inference cost. How to avoid this? We propose a novel architecture, projected networks (PN).
3
5
27
2/2 Findings: when the application-specific training budget is large, importance sampling is great. Otherwise, asymmetric models (big at train, small at inference e.g. mixture of experts or hyper-networks) are attractive, better than the popular distillation strategy.
1
0
1
New language model work! In practice, LMs often face a double constraint (i) small inference budget + (ii) little application-specific data: (i) means small specialized models for inference; (ii) means using auxiliary generic data e.g. for pretraining 1/2 https://t.co/E7MrinEcLq
1
13
35
Our analysis proposes a simple test to check if our method applies to your problem. Chat with us at our poster at #neurips2023 DistShift workshop next week. Joint work with Pierre Ablin, Awni Hannun. (3/3)
0
0
3
Large models are often trained on massive web datasets and a bit of target-task data. In this setup, it is 👍 to spend more train effort on specific parts of the large set. Our online algorithm maintains an auxiliary cheap filter model when training the large model. (2/3)
1
0
3
Efficient bilevel algorithm for training data selection https://t.co/dGWDOin2BJ
#bilevel #data_selection #DomainAdaptation #distshift #llm #NeurIPS2023 Online algorithm for filtering large (pre)training sets with maximal impact on the targeted task. (1/3)
1
19
81
A Natural Diet: Towards Improving Naturalness of Machine Translation. Freitag, Vilar, Grangier, Cherry and Foster https://t.co/blbYanprbY We study how to generate translation which appears more natural, i.e. like non-translated text initially written in the target language. 4/4
0
0
2
Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. Freitag, Foster, Grangier, Ratnakar, Tan, Macherey https://t.co/0qNeCNjkfv Evaluation of machine translation with tools designed to evaluate high quality human translation. 3/4.
1
0
3