Gowtham Ramesh
@gowtham_ramesh1
Followers
417
Following
7K
Media
0
Statuses
165
Applied RS GenAI @AMD | Ex - Student Researcher @GoogleAI NYC, @WisconsinCS, @ai4bharat, @iitmadras
San Jose, CA
Joined October 2016
We've just open-sourced our 1B language model with comprehensive details to make it easy to reproduce. Check out the model card and other details here:
huggingface.co
0
4
30
We present Olmo 3, our next family of fully open, leading language models. This family of 7B and 32B models represents: 1. The best 32B base model. 2. The best 7B Western thinking & instruct models. 3. The first 32B (or larger) fully open reasoning model. This is a big
94
357
2K
We are excited to release AgentEvolver , an open-source, self-evolving agent system from Tongyi Lab. AgentEvolver integrates three synergistic mechanismsโSelf-Questioning , Self-Navigating , and Self-Attributing โto systematically address critical bottlenecks in Agent RL
15
120
829
Defeating the Training-Inference Mismatch via FP16 Quick summary: A big problem in RL LLM training is that typical policy gradient methods expect the model generating the rollouts and the model being trained are exactly the same... but when you have a separate inference server
11
32
218
๐๐๐ Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels We at @SFResearch build an automated pipeline that converts raw web text into verifiable QA pairs, filtered and verified by LLMs, then use Group Relative Policy Optimization (GRPO) to train
1
13
84
Several of my team members + myself are impacted by this layoff today. Welcome to connect :)
472
278
6K
๐ ๐๐ฉ๐๐๐ญ๐!! A few more on ๐๐จ๐ฅ๐ฅ๐จ๐ฎ๐ญโ๐๐ซ๐๐ข๐ง๐ข๐ง๐ ๐๐ข๐ฌ๐ฆ๐๐ญ๐๐ก. ๐งต๐ (1/3) In Part I, we discuss ๐ญ๐จ๐ค๐๐งโ๐ฅ๐๐ฏ๐๐ฅ and ๐ฌ๐๐ช๐ฎ๐๐ง๐๐โ๐ฅ๐๐ฏ๐๐ฅ policy gradients. While ๐ข๐๐๐ง๐ญ๐ข๐๐๐ฅ for the classic REINFORCE algorithm, they ๐๐ข๐ฏ๐๐ซ๐ ๐ for trust
Failing on ๐ฅ๐๐ซ๐ ๐-๐ฌ๐๐๐ฅ๐ ๐๐ with VeRL? โ ๏ธ Mixing inference backend (๐ฏ๐๐๐/๐๐๐๐๐ง๐ ) with training backends (๐
๐๐๐/๐๐๐ ๐๐ญ๐ซ๐จ๐ง) ๐ฌ๐๐๐ซ๐๐ญ๐ฅ๐ฒ ๐ญ๐ฎ๐ซ๐ง๐ฌ ๐ฒ๐จ๐ฎ๐ซ ๐๐ ๐ข๐ง๐ญ๐จ ๐จ๐๐-๐ฉ๐จ๐ฅ๐ข๐๐ฒ โ even if they share the same weights! ๐ย Blog:
5
35
224
Weโve always assumed stale and off-policy data hurts RL a lot โ but our latest work shows the opposite. ๐ง M2PO (Second-Moment Trust Policy Optimization) reveals that even data stale by 256 model updates can train LLMs as effectively as on-policy RL, unlocking scalable and
m2po.notion.site
Haizhong Zheng, Jiawei Zhao, Beidi Chen
๐คCan we train RL on LLMs with extremely stale data? ๐Our latest study says YES! Stale data can be as informative as on-policy data, unlocking more scalable, efficient asynchronous RL for LLMs. We introduce M2PO, an off-policy RL algorithm that keeps training stable and
3
23
132
We now know that LoRA can match full-parameter RL training (from https://t.co/pGxoMLFIGv and our Tina paper https://t.co/dkXdxV3eNj), but what about DoRA, QLoRA, and more? We are releasing a clean LoRA-for-RL repo to explore them all. https://t.co/AsWWG1kmKt
LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.
13
70
565
Today, weโre announcing a multi-year, multi generation strategic partnership with @OpenAI that puts AMD compute at the center of the global AI infrastructure buildout. โ
6GW of AI infrastructure โ
Initial 1GW deployment of AMD Instinct MI450 series GPU capacity beginning 2H
287
1K
7K
As we mentioned back in April, AMD is in war mode and has developed a new sense of urgency. With this urgency, their software engineers are working harder and smarter than ever to fix bugs and improve the ROCm user experience, all with the goal of matching the CUDA experience.
34
39
452
๐ ๐๐ฉ๐๐๐ญ๐!! A few additional findings for ๐๐จ๐ฅ๐ฅ๐จ๐ฎ๐ญโ๐๐ซ๐๐ข๐ง๐ข๐ง๐ ๐๐ข๐ฌ๐ฆ๐๐ญ๐๐ก: โ ๐๐๐ซ๐๐ฅ๐ฅ๐๐ฅ๐ข๐ฌ๐ฆ ๐๐ข๐๐๐๐ซ๐๐ง๐๐ is a huge driver of the gap, with Sequence Parallelism (SP) causing most high max mismatch. โก ๐๐จ๐ง๐ ๐๐ซ ๐ฌ๐๐ช๐ฎ๐๐ง๐๐๐ฌ
Failing on ๐ฅ๐๐ซ๐ ๐-๐ฌ๐๐๐ฅ๐ ๐๐ with VeRL? โ ๏ธ Mixing inference backend (๐ฏ๐๐๐/๐๐๐๐๐ง๐ ) with training backends (๐
๐๐๐/๐๐๐ ๐๐ญ๐ซ๐จ๐ง) ๐ฌ๐๐๐ซ๐๐ญ๐ฅ๐ฒ ๐ญ๐ฎ๐ซ๐ง๐ฌ ๐ฒ๐จ๐ฎ๐ซ ๐๐ ๐ข๐ง๐ญ๐จ ๐จ๐๐-๐ฉ๐จ๐ฅ๐ข๐๐ฒ โ even if they share the same weights! ๐ย Blog:
2
28
173
(1/x) Ever had your #LLM-#RL training mysteriously collapse? ๐ You're not alone. We saw #agentic RL runs fail with exploding #gradients, and found the culprit: a fundamental "training-inference mismatch." Our new #blog post demystifies this vicious cycle.
11
52
313
[LG] APRIL: Active Partial Rollouts in Reinforcement Learning to tame long-tail generation Y Zhou, J Li, Y Su, G Ramesh... [Advanced Micro Devices, Inc. (AMD) & CMU] (2025) https://t.co/gucf2SL6Pb
0
3
8
We trained the whole stack. ๐ฃ๐ฟ๐ฒ๐๐ฟ๐ฎ๐ถ๐ป. ๐ฆ๐๐ง. ๐ฅ๐. ๐ข๐ฝ๐ฒ๐ป ๐๐ฒ๐ถ๐ด๐ต๐๐. ๐ข๐ฝ๐ฒ๐ป ๐บ๐ฒ๐๐ต๐ผ๐ฑ๐. ๐ข๐ฝ๐ฒ๐ป ๐๐ฐ๐ถ๐ฒ๐ป๐ฐ๐ฒ. From tokens to traces. From guesses to grounded. ๐๐ผ๐ฑ๐ฒ ๐ช๐ผ๐ฟ๐น๐ฑ ๐ ๐ผ๐ฑ๐ฒ๐น is here 50+ pages Report. Ckpts. Code. https://t.co/e5dLMMJy3I
(๐งต) Today, we release Meta Code World Model (CWM), a 32-billion-parameter dense LLM that enables novel research on improving code generation through agentic reasoning and planning with world models. https://t.co/BJSUCh2vtg
21
70
650
APRIL: Active Partial Rollouts in Reinforcement Learning to tame long-tail generation "we propose Active Partial Rollouts in Reinforcement Learning (APRIL), which mitigates long-tail inefficiency." "Experiments show that APRIL improves rollout throughput by at most 44% across
4
19
140
@zephyr_z9 ๐ Although AMD is now working pretty well for small to medium sized models
120
188
2K
Weโre live! ๐ This is the official account for slime โ an open-source, SGLang-native post-training framework for RL scaling. Kicking things off with our first milestone โ v0.1.0 release ๐งช Blog: https://t.co/ORH3J6UTYL Follow us to run RL faster โก๏ธ
16
8
30
New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work! Took me a while to get this level of understanding of the codebase and then to write up
63
405
3K