MahanFathi Profile Banner
Mahan Fathi Profile
Mahan Fathi

@MahanFathi

Followers
951
Following
208
Media
18
Statuses
71

llm research @nvidia👁️; ex @googledeepmind, @google🧠 & @mila_quebec.

Toronto, Ontario
Joined June 2011
Don't wanna be here? Send us removal request.
@MahanFathi
Mahan Fathi
12 days
We're looking for Summer Interns to join the Post-Training Team at @NVIDIA! DM me with your updated resume and three concise bullets detailing your most relevant experience — e.g. publications, repos, blogs, etc. RT please to help us find top talent.
13
35
460
@ShashwatGoel7
Shashwat Goel
6 months
Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇
33
125
874
@kuchaev
Oleksii Kuchaiev
6 months
NeMo RL is now open source! It replaces NeMo-Aligner and is the toolkit we use to post train next generations of our models. Give it a try
Tweet card summary image
github.com
Scalable toolkit for efficient model reinforcement - NVIDIA-NeMo/RL
5
65
396
@kuchaev
Oleksii Kuchaiev
7 months
Llama-Nemotron-v1 technical report is now available on arxiv https://t.co/OwFdIZnYlH
3
65
348
@RGoroshin
Ross Goroshin
1 year
The talk I gave @ Mila on learning linearized representations of dynamical systems (Koopman representations) is on YouTube. The work was mainly carried out by @MahanFathi in collaboration with @pierrelux 's lab, and was presented at ICLR 2024. https://t.co/EPlTCIQj5O
0
3
21
@g_lajoie_
Guillaume Lajoie
1 year
In-context learnin (ICL) is one of the most exciting part of the LLM boom. Sequence models (not just LLMs) implement on-the-fly models conditionned on inputs w/o weight updates! Q: are in-context models better than «in-weights» ones? A: some times ICL is better than standard opt.
@EricElmoznino
Eric Elmoznino
1 year
Introducing our new paper explaining in-context learning through the lens of Occam’s razor, giving a normative account of next-token prediction objectives. This was with @Tom__Marty @tejaskasetty @le0gagn0n @sarthmit @MahanFathi @dhanya_sridhar @g_lajoie_
0
6
21
@EricElmoznino
Eric Elmoznino
1 year
Introducing our new paper explaining in-context learning through the lens of Occam’s razor, giving a normative account of next-token prediction objectives. This was with @Tom__Marty @tejaskasetty @le0gagn0n @sarthmit @MahanFathi @dhanya_sridhar @g_lajoie_
Tweet card summary image
arxiv.org
A central goal of machine learning is generalization. While the No Free Lunch Theorem states that we cannot obtain theoretical guarantees for generalization without further assumptions, in...
3
24
103
@MahanFathi
Mahan Fathi
1 year
life update: thrilled to announce that i’ll be joining @nvidia as a research scientist on the alignment team. grateful for the support from mentors and peers. this is a dream come true for both the researcher and the gamer in me!
33
4
410
@RGoroshin
Ross Goroshin
2 years
Last week, I gave a talk at @Mila_Quebec. The talk should be of interest to anyone working on predictive models, particularly in latent space. In collab. with @MahanFathi @ClementGehring @J_Pilault @davidkanaa @pierrelux. See you at @iclr_conf in 🇦🇹! https://t.co/vFBtHDzNju
drive.google.com
0
5
18
@pierrelux
Pierre-Luc Bacon
2 years
Congrats to Mahan, who is finishing his Master's thesis in beauty with this second paper.
@MahanFathi
Mahan Fathi
2 years
Course Correcting Koopman Representations Accepted at #ICLR2024! We identify problems with unrolling in imagination and propose an unconventional, simple, yet effective solution: periodically "𝒓𝒆𝒆𝒏𝒄𝒐𝒅𝒊𝒏𝒈" the latent. 📄 https://t.co/ULNzqAV3bB @GoogleDeepMind 1/🧵
0
3
26
@MahanFathi
Mahan Fathi
2 years
This was joint work between @GoogleDeepMind and @Mila_Quebec. Many thanks to my supervisors @RGoroshin and @pierrelux for their constant support and guidance throughout the project. Also props to @ClementGehring, @J_Pilault and @davidkanaa. See you in Vienna! ❤️ 14/14
1
0
3
@MahanFathi
Mahan Fathi
2 years
We have more theory and experiments in the paper, including higher-dim systems like MuJoCo environments (with control inputs!). Periodic reencoding always leads to (big) improvements, only at the cost of introducing one inference-time hyperparam, the reencoding period. 13/
1
0
3
@MahanFathi
Mahan Fathi
2 years
This method produces stable, accurate, long-range future state predictions while being fairly robust to the reencoding period, i.e. the number of steps taken in latent space before reencoding happens. `reencode @ 0` -> no reencoding `reencode @ 1` -> every-step reencoding 12/
1
0
2
@MahanFathi
Mahan Fathi
2 years
So far we have found out that 1) reencoding is necessary, and 2) it introduces its own error. We have discovered an effective tool, although imperfect. So let's use it in moderation. Enter "Periodic Reencoding!" Here we reencode the representations every so often (k steps). 11/
1
0
2
@MahanFathi
Mahan Fathi
2 years
Fruit for thought: this is a bit weird because we expect the encoder and the decoder to be the inverses of one another, but they're not (why?). Unrolling the model this way, by "reencoding at every step," also results in poor performance, but at least w/o crossing behavior. 10/
1
0
2
@MahanFathi
Mahan Fathi
2 years
We can form a loop by going from (z) to (x) at every unrolling step, and then back to (z). We call this "reencoding," achieved by calling the decoder and the encoder function over (z): (ϕ◦ψ(z)). 9/
1
0
3
@MahanFathi
Mahan Fathi
2 years
There are 2 reasons for this. R1. We are modeling a closed system, with an open system. The original DS has the form (x' = f(x)), which forms a feedback loop. That "loop" is missing here. R2. The mapping from (z) to (x), i.e. the decoder, is non-injective, since (n > d)! 8/
1
0
2
@MahanFathi
Mahan Fathi
2 years
A simple observation from the above plot is that the trajectory lines *cross* and this violates the first principles of an autonomous dynamical system. We know that (z) trajectories are faithful, and don't cross. Why do all of a sudden we get this behavior in (x) space? 7/
1
0
3
@MahanFathi
Mahan Fathi
2 years
Here we train the model on the Duffing Oscillator system and look at the phase plots generated by unrolling the model using the above method. Well, things seem a bit off here. 6/
1
0
2
@MahanFathi
Mahan Fathi
2 years
Cool. Now that we have a trained model, we should be able to take an initial condition as input (x), encode it to get the first latent (z), keep hitting (z) with (K) to get future (z)'s, and then decode everything back to (x). Let's try that out on a few dynamical systems. 5/
1
0
3