Mahan Fathi @MahanFathi X Profile

Mahan Fathi

@MahanFathi

Followers

951

Following

208

Media

18

Statuses

71

llm research @nvidia👁️; ex @googledeepmind, @google🧠 & @mila_quebec.

https://t.co/dnf2MMnuHm

Toronto, Ontario

Joined June 2011

Don't wanna be here? Send us removal request.

Mahan Fathi

@MahanFathi

12 days

We're looking for Summer Interns to join the Post-Training Team at @NVIDIA! DM me with your updated resume and three concise bullets detailing your most relevant experience — e.g. publications, repos, blogs, etc. RT please to help us find top talent.

13

35

460

Shashwat Goel

@ShashwatGoel7

6 months

Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇

33

125

874

Oleksii Kuchaiev

@kuchaev

6 months

NeMo RL is now open source! It replaces NeMo-Aligner and is the toolkit we use to post train next generations of our models. Give it a try

github.com

Scalable toolkit for efficient model reinforcement - NVIDIA-NeMo/RL

5

65

396

Oleksii Kuchaiev

@kuchaev

7 months

Llama-Nemotron-v1 technical report is now available on arxiv https://t.co/OwFdIZnYlH

3

65

348

Ross Goroshin

@RGoroshin

1 year

The talk I gave @ Mila on learning linearized representations of dynamical systems (Koopman representations) is on YouTube. The work was mainly carried out by @MahanFathi in collaboration with @pierrelux 's lab, and was presented at ICLR 2024. https://t.co/EPlTCIQj5O

0

3

21

Guillaume Lajoie

@g_lajoie_

1 year

In-context learnin (ICL) is one of the most exciting part of the LLM boom. Sequence models (not just LLMs) implement on-the-fly models conditionned on inputs w/o weight updates! Q: are in-context models better than «in-weights» ones? A: some times ICL is better than standard opt.

Eric Elmoznino

@EricElmoznino

1 year

Introducing our new paper explaining in-context learning through the lens of Occam’s razor, giving a normative account of next-token prediction objectives. This was with @Tom__Marty @tejaskasetty @le0gagn0n @sarthmit @MahanFathi @dhanya_sridhar @g_lajoie_

0

6

21

Eric Elmoznino

@EricElmoznino

1 year

Introducing our new paper explaining in-context learning through the lens of Occam’s razor, giving a normative account of next-token prediction objectives. This was with @Tom__Marty @tejaskasetty @le0gagn0n @sarthmit @MahanFathi @dhanya_sridhar @g_lajoie_

arxiv.org

A central goal of machine learning is generalization. While the No Free Lunch Theorem states that we cannot obtain theoretical guarantees for generalization without further assumptions, in...

3

24

103

Mahan Fathi

@MahanFathi

1 year

life update: thrilled to announce that i’ll be joining @nvidia as a research scientist on the alignment team. grateful for the support from mentors and peers. this is a dream come true for both the researcher and the gamer in me!

33

4

410

Ross Goroshin

@RGoroshin

2 years

Last week, I gave a talk at @Mila_Quebec. The talk should be of interest to anyone working on predictive models, particularly in latent space. In collab. with @MahanFathi @ClementGehring @J_Pilault @davidkanaa @pierrelux. See you at @iclr_conf in 🇦🇹! https://t.co/vFBtHDzNju

drive.google.com

0

5

18

Pierre-Luc Bacon

@pierrelux

2 years

Congrats to Mahan, who is finishing his Master's thesis in beauty with this second paper.

Mahan Fathi

@MahanFathi

2 years

Course Correcting Koopman Representations Accepted at #ICLR2024! We identify problems with unrolling in imagination and propose an unconventional, simple, yet effective solution: periodically "𝒓𝒆𝒆𝒏𝒄𝒐𝒅𝒊𝒏𝒈" the latent. 📄 https://t.co/ULNzqAV3bB @GoogleDeepMind 1/🧵

0

3

26

Mahan Fathi

@MahanFathi

2 years

This was joint work between @GoogleDeepMind and @Mila_Quebec. Many thanks to my supervisors @RGoroshin and @pierrelux for their constant support and guidance throughout the project. Also props to @ClementGehring, @J_Pilault and @davidkanaa. See you in Vienna! ❤️ 14/14

1

0

3

Mahan Fathi

@MahanFathi

2 years

We have more theory and experiments in the paper, including higher-dim systems like MuJoCo environments (with control inputs!). Periodic reencoding always leads to (big) improvements, only at the cost of introducing one inference-time hyperparam, the reencoding period. 13/

1

0

3

Mahan Fathi

@MahanFathi

2 years

This method produces stable, accurate, long-range future state predictions while being fairly robust to the reencoding period, i.e. the number of steps taken in latent space before reencoding happens. `reencode @ 0` -> no reencoding `reencode @ 1` -> every-step reencoding 12/

1

0

2

Mahan Fathi

@MahanFathi

2 years

So far we have found out that 1) reencoding is necessary, and 2) it introduces its own error. We have discovered an effective tool, although imperfect. So let's use it in moderation. Enter "Periodic Reencoding!" Here we reencode the representations every so often (k steps). 11/

1

0

2

Mahan Fathi

@MahanFathi

2 years

Fruit for thought: this is a bit weird because we expect the encoder and the decoder to be the inverses of one another, but they're not (why?). Unrolling the model this way, by "reencoding at every step," also results in poor performance, but at least w/o crossing behavior. 10/

1

0

2

Mahan Fathi

@MahanFathi

2 years

We can form a loop by going from (z) to (x) at every unrolling step, and then back to (z). We call this "reencoding," achieved by calling the decoder and the encoder function over (z): (ϕ◦ψ(z)). 9/

1

0

3

Mahan Fathi

@MahanFathi

2 years

There are 2 reasons for this. R1. We are modeling a closed system, with an open system. The original DS has the form (x' = f(x)), which forms a feedback loop. That "loop" is missing here. R2. The mapping from (z) to (x), i.e. the decoder, is non-injective, since (n > d)! 8/

1

0

2

Mahan Fathi

@MahanFathi

2 years

A simple observation from the above plot is that the trajectory lines *cross* and this violates the first principles of an autonomous dynamical system. We know that (z) trajectories are faithful, and don't cross. Why do all of a sudden we get this behavior in (x) space? 7/

1

0

3

Mahan Fathi

@MahanFathi

2 years

Here we train the model on the Duffing Oscillator system and look at the phase plots generated by unrolling the model using the above method. Well, things seem a bit off here. 6/

1

0

2

Mahan Fathi

@MahanFathi

2 years

Cool. Now that we have a trained model, we should be able to take an initial condition as input (x), encode it to get the first latent (z), keep hitting (z) with (K) to get future (z)'s, and then decode everything back to (x). Let's try that out on a few dynamical systems. 5/

1

0

3