Gabriel P. Andrade @gab_p_andrade X Profile

Gabriel P. Andrade

@gab_p_andrade

Followers

250

Following

84

Media

4

Statuses

38

Researcher @GensynAI. Working on multi-agent RL, game theory, alg econ, and decentralized learning.

https://t.co/NahhgVqs7U

Joined April 2025

Don't wanna be here? Send us removal request.

Gabriel P. Andrade

@gab_p_andrade

3 days

I've always been a sucker for reductions between one learning setting to another. Just feels elegant to know insights in one will trickle over to the other "for free"

Damek

@damekdavis

4 days

In this note w/ @beenwrekt we look at RL problems with 0/1 rewards, showing that popular methods maximize the average (transformed) probability of correctly answering a prompt x: max_θ 𝔼ₓ h(Prob(correct ∣ x; θ)) for certain functions h. Weirdly, h is arcsin(√t) in GRPO.

0

4

Tarek Naous

@tareknaous

9 days

Simulating user–AI conversations helps us understand how LMs work in multi-turn settings. Prompting LMs like GPT-4o to simulate users is common, but their assistant nature makes it hard to replicate user behavior. We introduce User LMs - trained to be users, not assistants.

2

27

146

Sumeet Motwani

@sumeetrm

10 days

🚨How do we improve long-horizon reasoning capabilities by scaling RL with only existing data? Introducing our new paper: "h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning"🫡 > RL on existing datasets saturates very quickly > Reasoning over

10

47

279

Dylan Foster 🐢

@canondetortugas

8 days

New paper we're excited to get online! Taming Imperfect Process Verifiers: A Sampling Perspective on Backtracking. A totally new framework based on ~backtracking~ for using process verifiers to guide inference, w/ connections to approximate counting/sampling in theoretical CS.

8

41

245

Kevin Patrick Murphy

@sirbayes

11 days

For more details (including how to train an RNN-based policy using PPO inside the "imagination" of the learned CWM), please see our paper: https://t.co/L9N7CeAB9s Joint work with @WLehrach, Daniel Hennes, @lazarox8, @heroesneverdie, Carter Wendelken, Zun Li, @antoine_dedieu,

0

8

37

Zang

@SplezzzK

23 days

🌟 Excited to share that our paper, “From Self-Check to Consensus: Bayesian Strategic Decoding in Large Language Models”, has been accepted by #NeurIPS2025! Huge thanks to my coauthors @BernhardKainz1 and Weitong Zhang on this wonderful work!

11

8

43

Connor Davis

@connordavis_ai

26 days

Multi-agent AI is a $50B lie. 99% of "multi-agent" systems are just single agents with fancy marketing. I just read the paper that exposes what real multi-agent intelligence actually looks like. Most people think multi-agent AI is just "multiple ChatGPTs in a room. That's

65

209

1K

Nan Jiang

@nanjiang_cs

28 days

I was surprised by how many didnt know that (1) per token MLE is whole seq MLE, and (2) PG at token level same as PG at seq level (optimizkng one big combinatorial action). story is different if you introduce fitted critic/Q-values or intermediate resets.

Nando de Freitas

@NandoDF

28 days

Most RL for LLMs involves only 1 step of RL. It’s a contextual bandit problem and there’s no covariate shift because the state (question, instruction) is given. This has many implications, eg DAgger becomes SFT, and it is trivial to design Expectation Maximisation (EM) maximum

9

39

362

Gabriel P. Andrade

@gab_p_andrade

1 month

Cool work laying rigorous foundations with practical insights for building well-aligned AI systems: Should we trust individual orgs will produce well-aligned AI that aren't self-serving? No. Let them compete, don't assume their altruism, and design mechanisms accordingly.

Aaron Roth

@Aaroth

1 month

Aligning an AI with human preferences might be hard. But there is more than one AI out there, and users can choose which to use. Can we get the benefits of a fully aligned AI without solving the alignment problem? In a new paper we study a setting in which the answer is yes.

0

1

Kaiyan Zhang

@OkhayIea

1 month

🚀 Excited to share our new survey paper on RL for Large Reasoning Models (LRMs)! Since early this year, our team has released several RL+LLMs works (PRIME, TTRL, SimpleVLA, MARTI, SSRL, HPT), covering dense rewards, self-evolution, embodied AI, multi-agent, tool learning, and

6

76

343

Gabriel P. Andrade

@gab_p_andrade

1 month

Join the Swarm ->

github.com

A fully open source framework for creating RL training swarms over the internet. - gensyn-ai/rl-swarm

0

5

Gabriel P. Andrade

@gab_p_andrade

1 month

Paper ->

arxiv.org

Post-training language models (LMs) with reinforcement learning (RL) can enhance their complex reasoning capabilities without supervised fine-tuning, as demonstrated by DeepSeek-R1-Zero. However,...

1

6

Gabriel P. Andrade

@gab_p_andrade

1 month

Thanks to all my wonderful collaborators and the awesome @gensynai community members who have contributed to our testnet! Your continued support makes it possible for us to iterate and experiment at unprecedented scales!

1

0

4

Gabriel P. Andrade

@gab_p_andrade

1 month

In our open source demo, thousands of @gensynai community members trained models on a range of models and devices. After approximately 175 training rounds, the performance of models in the swarm significantly outperform models in silo. Below, red == adjusted p-value > 0.05.

2

1

8

Gabriel P. Andrade

@gab_p_andrade

1 month

In controlled experiments, models trained with SAPO show ~94% improvement in cumulative reward over models trained in silo. We compared models trained with batches of X local vs Y swarm samples; there was a clear trend.

3

15

Gabriel P. Andrade

@gab_p_andrade

1 month

🐸 SAPO is a meta-algorithm that wraps around your preferred policy gradient algorithm → Generate rollouts on a local batch of data, share with + sample from the swarm, update your policy, repeat.

1

0

3

Gabriel P. Andrade

@gab_p_andrade

1 month

Problem: Scaling RL is non-trivial — it’s expensive, latency-sensitive, memory-intensive, and failure-sensitive 🐸 SAPO sidesteps these hurdles → fully decentralized, asynchronous, and designed for heterogenous devices + heterogenous models 🐸 SAPO is highly customizable →

1

0

2

Gabriel P. Andrade

@gab_p_andrade

1 month

TLDR • Heterogeneous devices & heterogeneous models collectively train • Models generate rollouts locally and share, then sample rollouts shared by others • With SAPO, models can train faster with less compute per node

1

0

5

Gabriel P. Andrade

@gab_p_andrade

1 month

Is the whole greater than the sum of its parts? In decentralized RL post-training the answer is YES. 🐸🐸🐸 Swarm sAmpling Policy Optimization (SAPO) 🐸🐸🐸

9

23

79

gensyn

@gensynai

1 month

Something new for the community 🐝 https://t.co/R1qM2UxDT1

88

45

374