Ilia Kulikov @uralik1 X Profile

Ilia Kulikov

@uralik1

Followers

520

Following

696

Media

23

Statuses

118

ramming llms @MetaAI

New York, NY

Joined March 2018

Don't wanna be here? Send us removal request.

Ilia Kulikov

@uralik1

5 months

RT @jaseweston: 🥥🌪️ Introducing CoCoMix - a LLM pretraining framework that predicts concepts and mixes them into its hidden state to improv….

0

59

0

Ilia Kulikov

@uralik1

5 months

We are using fairseq2 for llm post-training research in our team. This release comes with a decent documentation ( 😅 My favorite feature of the lib is the runtime extension support: one can develop research code without forking out the entire lib repo!.

fairseq2

@fairseq2

5 months

👋 Hello world! We’re thrilled to announce the v0.4 release of fairseq2 — an open-source library from FAIR powering many projects at Meta. pip install fairseq2 and explore our trainer API, instruction & preference finetuning (up to 70B), and native vLLM integration.

0

7

Ilia Kulikov

@uralik1

7 months

Interested in LLM inference algorithms? Please come and watch our tutorial next week! .

Sean Welleck

@wellecks

7 months

Curious about inference-time scaling, the #1 trending topic in LLMs?. Come to our NeurIPS tutorial: Beyond Decoding: Meta-Generation Algorithms for LLMs (Tue. @ 1:30)!.

0

23

Ilia Kulikov

@uralik1

3 years

@kchonyc @jaseweston thanks for being such amazing advisors.

0

10

Ilia Kulikov

@uralik1

3 years

literally me now:

12

1

246

Ilia Kulikov

@uralik1

4 years

paper: code: happy holidays🎄.

0

7

Ilia Kulikov

@uralik1

4 years

@kchonyc The proposed regularization expands the dynamic range of the prob and rank of eos when it is not supposed to be. We see improvements in translation quality when large beam sizes (up to 1000) are used. But, the gap between performance w/ smaller and larger beam is still there!.2/n.

1

0

4

Ilia Kulikov

@uralik1

4 years

🚨new research!.Probability of short sequences tend to be too high with autoregressive NMT (and beyond). We quantified this tendency and defined the oversmoothing rate. We minimize its upper bound, oversmoothing loss, and present our findings!.w/ Maksim Eremeev and @kchonyc !.1/n

2

8

82

Ilia Kulikov

@uralik1

4 years

Apparently ancestral sampling yields high quality translations if we sample enough number of times, but how to choose one of them in the end? @BryanEikema shows how to scale utility computations over large hypotheses spaces efficiently! very cool.

Bryan Eikema

@BryanEikema

4 years

Check out our latest work on minimum Bayes risk decoding for NMT! We show that MBR is a robust decision rule and sampling-based approximations scale well with more computation. Unlike MAP, more computation always improves translation quality. paper: 1/4

0

2

Ilia Kulikov

@uralik1

4 years

it is over now. Thanks everyone, I thought we will break the gathertown infra!!

0

9

Ilia Kulikov

@uralik1

4 years

phew, the room doesn't look overcrowded yet

1

0

4

Ilia Kulikov

@uralik1

4 years

Poster #90 in gather town, I am there *now*!.

Ilia Kulikov

@uralik1

4 years

🚨Our new research at SPNLP21 workshop!🚨 .Done with @wellecks & @kchonyc. We worked out a way to measure the mode mismatch on the sequence level between distributions along the sequence modeling pipeline. But it is not easy to use it with big models in real world tasks. (1/6)

1

5

Ilia Kulikov

@uralik1

4 years

RT @wellecks: "Mode recovery in neural autoregressive sequence modeling". We study mismatches between the most probable sequences of each s….

0

3

0

Ilia Kulikov

@uralik1

4 years

We have both live QA and poster on Friday afternoon EST timezone: .talk & poster: (underline link).arxiv: We will be happy to discuss our work with you, please join the session! (6/6).

0

Ilia Kulikov

@uralik1

4 years

Finally we have studied mode recovery under different learning chain settings and figured that (1) mode recovery cost is non-trivial at every step of the chain and (2) the pattern of mode recovery heavily depends on the property of the ground-truth distribution. (5/6).

1

0

Ilia Kulikov

@uralik1

4 years

We designed a tractable ('toy') setup which consists of the entire learning chain using a small enough sequence-level space which allowed us to perform exact search over the distributions of interest. This design is a subject for discussion so we are looking for your feedback!4/6

1

0

Ilia Kulikov

@uralik1

4 years

Then we introduce the *mode recovery cost* which quantifies mode degradation between a pair of distributions. This cost requires computation of the top-k mode set of the distribution on the sequence-level which makes it less practical to apply in real world setting. (3/6)

1

0

1

Ilia Kulikov

@uralik1

4 years

We start by introducing the learning chain of sequence-level distributions and hypothesize that the mode degradation we observe in real world tasks may be rooted at different links of the corresponding chain. (2/6)

1

0

Ilia Kulikov

@uralik1

4 years

🚨Our new research at SPNLP21 workshop!🚨 .Done with @wellecks & @kchonyc. We worked out a way to measure the mode mismatch on the sequence level between distributions along the sequence modeling pipeline. But it is not easy to use it with big models in real world tasks. (1/6)

1

7

14

Ilia Kulikov

@uralik1

4 years

haha they got me laughing when I saw this new iPad translation feature in WWDC video. At least looks like they are not cherry-picking there!!. (the word жаренсезонных did not really exist until today 😂)

0

4