SkanderMoalla Profile Banner
Skander Moalla Profile
Skander Moalla

@SkanderMoalla

Followers
236
Following
349
Media
22
Statuses
114

PhD @ the Caglar Gulcehre Lab for AI Research (CLAIRE) @EPFL. Deep Reinforcement Learning, RLHF, foundation models.

Lausanne, Switzerland
Joined January 2017
Don't wanna be here? Send us removal request.
@SkanderMoalla
Skander Moalla
19 days
šŸš€ Big time! We can finally do LLM RL fine-tuning with rewards and leverage offline/off-policy data!. āŒ You want rewards, but GRPO only works online?.āŒ You want offline, but DPO is limited to preferences?.āœ… QRPO can do both!. 🧵Here's how we do it:
Tweet media one
3
36
138
@SkanderMoalla
Skander Moalla
9 days
RT @MiTerekhov: Well, to avoid steganography, let's make sure our multi-agent LLM research workflows are composed of agents with different….
0
2
0
@SkanderMoalla
Skander Moalla
17 days
RT @XiuyingWei966: If you’re interested in long-context efficiency, don’t miss our recent paper RAT—a joint effort with @anunay_yadav, Razv….
0
3
0
@SkanderMoalla
Skander Moalla
17 days
RT @MiTerekhov: AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on….
0
20
0
@SkanderMoalla
Skander Moalla
18 days
RT @caglarml: I am proud that our latest work on a novel RL method for foundation models/LLMs is finally out!. 1ļøāƒ£ Why does QRPO matter?.Al….
0
6
0
@SkanderMoalla
Skander Moalla
18 days
RT @matrs01: I couldn’t be prouder to share this. šŸŽ‰.Our work on Quantile Reward Policy Optimization (QRPO) for LLM RL‑finetuning bridged de….
0
2
0
@SkanderMoalla
Skander Moalla
19 days
I’m really proud of this work! It’s been an amazing collaboration with @matrs01 and @caglarml!. šŸ“° Paper: Hidden gems and open questions in the 30+ page appendixšŸ’Ž. šŸ§‘ā€šŸ’» Code: Star to show interest⭐. 🌐 Blog:
claire-labo.github.io
Alignment with Pointwise Regression and Exact Partition Functions
0
1
4
@SkanderMoalla
Skander Moalla
19 days
QRPO is a framework. You can shape the optimal policy! šŸŽ›ļø. We derive a framework around QRPO for using transformations on top of the quantile reward. Each transformation reshapes the reward distribution and affects the properties of the optimal policy, while having
Tweet media one
Tweet media two
1
0
3
@SkanderMoalla
Skander Moalla
19 days
Is QRPO still subject to the "chosen probabilities decreasing" problem?. Our understanding of the KL-regularized closed-form solution gives insights into the "DPO chosen probabilities decreasing" problem! šŸ¤”. For QRPO, this is not a mystery anymore; we know exactly where the
Tweet media one
Tweet media two
1
0
3
@SkanderMoalla
Skander Moalla
19 days
šŸ’¬ The reward model we use has been trained to be robust to length bias, and we see that this is preserved in QRPO and REBEL, which use rewards. But when compressed to preferences for DPO and SimPO, it leads to the typical length bias trend, despite the reduction in mean length.
Tweet media one
1
0
3
@SkanderMoalla
Skander Moalla
19 days
šŸ„‡ QRPO achieves top performance in chat and coding compared to DPO, REBEL, and SimPO, each capturing a different way to learn from the reward signal (preference, reward difference, length normalization).
Tweet media one
Tweet media two
Tweet media three
1
0
4
@SkanderMoalla
Skander Moalla
19 days
Obviously, nothing comes for free, but we give you a great deal! šŸ¤. * QRPO does not need many reference rewards to estimate quantiles: for high-quality offline datasets, 1-3 are enough!. * And you can scale this number for off-policy data generated from the reference model! šŸ“ˆ
Tweet media one
1
0
3
@SkanderMoalla
Skander Moalla
19 days
We tackle the infamous ā€œ. partition function is known to be intractable. ā€ problem 🧐. This is the problem that limits DPO-like methods to pairwise data. We solve it thanks to 3 insights! šŸ’”. 1ļøāƒ£ The ā€œinfinite sum over all possible LLM generationsā€ argument is a myth. We
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
1
7
@SkanderMoalla
Skander Moalla
20 days
RT @XiuyingWei966: Curious about making Transformers faster on long sequences without compromising accuracy? āš”ļøšŸ§  Meet RAT—an intermediate d….
0
9
0
@SkanderMoalla
Skander Moalla
2 months
RT @abeirami: As we go through a lot of excitement about RL recently with lots of cool work/results, here is a reminder that RL with a reve….
0
52
0
@SkanderMoalla
Skander Moalla
3 months
RT @manuelmlmadeira: Unfortunately I’m not attending @iclr_conf , but excited to share our workshop papers!. šŸ“” Graph Discrete Diffusion: A….
0
5
0
@SkanderMoalla
Skander Moalla
3 months
RT @AnjaSurina: Excited to share our latest work on EvoTune, a novel method integrating LLM-guided evolutionary search and reinforcement le….
0
31
0
@SkanderMoalla
Skander Moalla
5 months
1
0
1
@SkanderMoalla
Skander Moalla
5 months
A dream come true! I presented "No Representation, No Trust" on my favorite RL podcast, @TalkRLPodcast !.Make sure to check it out to learn why training with PPO for too long makes your agent collapse!.
@TalkRLPodcast
TalkRL Podcast
5 months
E63: NeurIPS 2024 - Posters and Hallways 1. @JiahengHu1 of @UTCompSci on Unsupervised Skill Discovery for HRL.@SkanderMoalla of @EPFL: Representation and Trust in PPO.Adil Zouitine of IRT Saint Exupery/@Hugging Face : Time-Constrained Robust MDPs.@Shoumo_ of @hplabs :
Tweet media one
1
3
19
@SkanderMoalla
Skander Moalla
5 months
RT @jdeschena: Just taking advantage of the wave of excitement around diffusion LLMs to announce that our acceleration method for diffusion….
0
8
0