
Gabriel P. Andrade
@gab_p_andrade
Followers
250
Following
84
Media
4
Statuses
38
Researcher @GensynAI. Working on multi-agent RL, game theory, alg econ, and decentralized learning.
Joined April 2025
I've always been a sucker for reductions between one learning setting to another. Just feels elegant to know insights in one will trickle over to the other "for free"
In this note w/ @beenwrekt we look at RL problems with 0/1 rewards, showing that popular methods maximize the average (transformed) probability of correctly answering a prompt x: max_θ 𝔼ₓ h(Prob(correct ∣ x; θ)) for certain functions h. Weirdly, h is arcsin(√t) in GRPO.
0
0
4
Simulating user–AI conversations helps us understand how LMs work in multi-turn settings. Prompting LMs like GPT-4o to simulate users is common, but their assistant nature makes it hard to replicate user behavior. We introduce User LMs - trained to be users, not assistants.
2
27
146
🚨How do we improve long-horizon reasoning capabilities by scaling RL with only existing data? Introducing our new paper: "h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning"🫡 > RL on existing datasets saturates very quickly > Reasoning over
10
47
279
New paper we're excited to get online! Taming Imperfect Process Verifiers: A Sampling Perspective on Backtracking. A totally new framework based on ~backtracking~ for using process verifiers to guide inference, w/ connections to approximate counting/sampling in theoretical CS.
8
41
245
For more details (including how to train an RNN-based policy using PPO inside the "imagination" of the learned CWM), please see our paper: https://t.co/L9N7CeAB9s Joint work with @WLehrach, Daniel Hennes, @lazarox8, @heroesneverdie, Carter Wendelken, Zun Li, @antoine_dedieu,
0
8
37
🌟 Excited to share that our paper, “From Self-Check to Consensus: Bayesian Strategic Decoding in Large Language Models”, has been accepted by #NeurIPS2025! Huge thanks to my coauthors @BernhardKainz1 and Weitong Zhang on this wonderful work!
11
8
43
Multi-agent AI is a $50B lie. 99% of "multi-agent" systems are just single agents with fancy marketing. I just read the paper that exposes what real multi-agent intelligence actually looks like. Most people think multi-agent AI is just "multiple ChatGPTs in a room. That's
65
209
1K
I was surprised by how many didnt know that (1) per token MLE is whole seq MLE, and (2) PG at token level same as PG at seq level (optimizkng one big combinatorial action). story is different if you introduce fitted critic/Q-values or intermediate resets.
Most RL for LLMs involves only 1 step of RL. It’s a contextual bandit problem and there’s no covariate shift because the state (question, instruction) is given. This has many implications, eg DAgger becomes SFT, and it is trivial to design Expectation Maximisation (EM) maximum
9
39
362
Cool work laying rigorous foundations with practical insights for building well-aligned AI systems: Should we trust individual orgs will produce well-aligned AI that aren't self-serving? No. Let them compete, don't assume their altruism, and design mechanisms accordingly.
Aligning an AI with human preferences might be hard. But there is more than one AI out there, and users can choose which to use. Can we get the benefits of a fully aligned AI without solving the alignment problem? In a new paper we study a setting in which the answer is yes.
0
1
1
🚀 Excited to share our new survey paper on RL for Large Reasoning Models (LRMs)! Since early this year, our team has released several RL+LLMs works (PRIME, TTRL, SimpleVLA, MARTI, SSRL, HPT), covering dense rewards, self-evolution, embodied AI, multi-agent, tool learning, and
6
76
343
Thanks to all my wonderful collaborators and the awesome @gensynai community members who have contributed to our testnet! Your continued support makes it possible for us to iterate and experiment at unprecedented scales!
1
0
4
In our open source demo, thousands of @gensynai community members trained models on a range of models and devices. After approximately 175 training rounds, the performance of models in the swarm significantly outperform models in silo. Below, red == adjusted p-value > 0.05.
2
1
8
In controlled experiments, models trained with SAPO show ~94% improvement in cumulative reward over models trained in silo. We compared models trained with batches of X local vs Y swarm samples; there was a clear trend.
3
3
15
🐸 SAPO is a meta-algorithm that wraps around your preferred policy gradient algorithm → Generate rollouts on a local batch of data, share with + sample from the swarm, update your policy, repeat.
1
0
3
Problem: Scaling RL is non-trivial — it’s expensive, latency-sensitive, memory-intensive, and failure-sensitive 🐸 SAPO sidesteps these hurdles → fully decentralized, asynchronous, and designed for heterogenous devices + heterogenous models 🐸 SAPO is highly customizable →
1
0
2
TLDR • Heterogeneous devices & heterogeneous models collectively train • Models generate rollouts locally and share, then sample rollouts shared by others • With SAPO, models can train faster with less compute per node
1
0
5
Is the whole greater than the sum of its parts? In decentralized RL post-training the answer is YES. 🐸🐸🐸 Swarm sAmpling Policy Optimization (SAPO) 🐸🐸🐸
9
23
79