Ilia Kulikov Profile
Ilia Kulikov

@uralik1

Followers
520
Following
696
Media
23
Statuses
118

ramming llms @MetaAI

New York, NY
Joined March 2018
Don't wanna be here? Send us removal request.
@uralik1
Ilia Kulikov
5 months
RT @jaseweston: 🥥🌪️ Introducing CoCoMix - a LLM pretraining framework that predicts concepts and mixes them into its hidden state to improv….
0
59
0
@uralik1
Ilia Kulikov
5 months
We are using fairseq2 for llm post-training research in our team. This release comes with a decent documentation ( 😅 My favorite feature of the lib is the runtime extension support: one can develop research code without forking out the entire lib repo!.
@fairseq2
fairseq2
5 months
👋 Hello world! We’re thrilled to announce the v0.4 release of fairseq2 — an open-source library from FAIR powering many projects at Meta. pip install fairseq2 and explore our trainer API, instruction & preference finetuning (up to 70B), and native vLLM integration.
0
0
7
@uralik1
Ilia Kulikov
7 months
Interested in LLM inference algorithms? Please come and watch our tutorial next week! .
@wellecks
Sean Welleck
7 months
Curious about inference-time scaling, the #1 trending topic in LLMs?. Come to our NeurIPS tutorial: Beyond Decoding: Meta-Generation Algorithms for LLMs (Tue. @ 1:30)!.
Tweet media one
Tweet media two
0
0
23
@uralik1
Ilia Kulikov
3 years
@kchonyc @jaseweston thanks for being such amazing advisors.
0
0
10
@uralik1
Ilia Kulikov
3 years
literally me now:
Tweet media one
Tweet media two
12
1
246
@uralik1
Ilia Kulikov
4 years
paper: code: happy holidays🎄.
0
0
7
@uralik1
Ilia Kulikov
4 years
@kchonyc The proposed regularization expands the dynamic range of the prob and rank of eos when it is not supposed to be. We see improvements in translation quality when large beam sizes (up to 1000) are used. But, the gap between performance w/ smaller and larger beam is still there!.2/n.
1
0
4
@uralik1
Ilia Kulikov
4 years
🚨new research!.Probability of short sequences tend to be too high with autoregressive NMT (and beyond). We quantified this tendency and defined the oversmoothing rate. We minimize its upper bound, oversmoothing loss, and present our findings!.w/ Maksim Eremeev and @kchonyc !.1/n
Tweet media one
2
8
82
@uralik1
Ilia Kulikov
4 years
Apparently ancestral sampling yields high quality translations if we sample enough number of times, but how to choose one of them in the end? @BryanEikema shows how to scale utility computations over large hypotheses spaces efficiently! very cool.
@BryanEikema
Bryan Eikema
4 years
Check out our latest work on minimum Bayes risk decoding for NMT! We show that MBR is a robust decision rule and sampling-based approximations scale well with more computation. Unlike MAP, more computation always improves translation quality. paper: 1/4
Tweet media one
0
0
2
@uralik1
Ilia Kulikov
4 years
it is over now. Thanks everyone, I thought we will break the gathertown infra!!
Tweet media one
0
0
9
@uralik1
Ilia Kulikov
4 years
phew, the room doesn't look overcrowded yet
Tweet media one
1
0
4
@uralik1
Ilia Kulikov
4 years
Poster #90 in gather town, I am there *now*!.
@uralik1
Ilia Kulikov
4 years
🚨Our new research at SPNLP21 workshop!🚨 .Done with @wellecks & @kchonyc. We worked out a way to measure the mode mismatch on the sequence level between distributions along the sequence modeling pipeline. But it is not easy to use it with big models in real world tasks. (1/6)
Tweet media one
1
1
5
@uralik1
Ilia Kulikov
4 years
RT @wellecks: "Mode recovery in neural autoregressive sequence modeling". We study mismatches between the most probable sequences of each s….
0
3
0
@uralik1
Ilia Kulikov
4 years
We have both live QA and poster on Friday afternoon EST timezone: .talk & poster: (underline link).arxiv: We will be happy to discuss our work with you, please join the session! (6/6).
0
0
0
@uralik1
Ilia Kulikov
4 years
Finally we have studied mode recovery under different learning chain settings and figured that (1) mode recovery cost is non-trivial at every step of the chain and (2) the pattern of mode recovery heavily depends on the property of the ground-truth distribution. (5/6).
1
0
0
@uralik1
Ilia Kulikov
4 years
We designed a tractable ('toy') setup which consists of the entire learning chain using a small enough sequence-level space which allowed us to perform exact search over the distributions of interest. This design is a subject for discussion so we are looking for your feedback!4/6
Tweet media one
1
0
0
@uralik1
Ilia Kulikov
4 years
Then we introduce the *mode recovery cost* which quantifies mode degradation between a pair of distributions. This cost requires computation of the top-k mode set of the distribution on the sequence-level which makes it less practical to apply in real world setting. (3/6)
Tweet media one
1
0
1
@uralik1
Ilia Kulikov
4 years
We start by introducing the learning chain of sequence-level distributions and hypothesize that the mode degradation we observe in real world tasks may be rooted at different links of the corresponding chain. (2/6)
Tweet media one
1
0
0
@uralik1
Ilia Kulikov
4 years
🚨Our new research at SPNLP21 workshop!🚨 .Done with @wellecks & @kchonyc. We worked out a way to measure the mode mismatch on the sequence level between distributions along the sequence modeling pipeline. But it is not easy to use it with big models in real world tasks. (1/6)
Tweet media one
1
7
14
@uralik1
Ilia Kulikov
4 years
haha they got me laughing when I saw this new iPad translation feature in WWDC video. At least looks like they are not cherry-picking there!!. (the word жаренсезонных did not really exist until today 😂)
Tweet media one
0
0
4