Sumanth Hegde
@sumanthrh
Followers
943
Following
4K
Media
67
Statuses
403
Post-training @anyscalecompute. Prev - @UCSanDiego, @C3_AI, @iitmadras. Machine Learning and Systems. Intensity is all you need.
San Francisco
Joined February 2016
Some of our interesting observations from working on multi-turn text2SQL: - Data-efficient RL works pretty well: We did very typical GRPO settings; Just make sure to use "hard-enough" samples and no KL. KL can stabilize learning early on but will always bring down rewards
1/N Introducing SkyRL-SQL, a simple, data-efficient RL pipeline for Text-to-SQL that trains LLMs to interactively probe, refine, and verify SQL queries with a real database. ๐ Early Result: trained on just ~600 samples, SkyRL-SQL-7B outperforms GPT-4o, o4-mini, and SFT model
0
5
18
Wide-EP and prefill/decode disaggregation APIs for vLLM are now available in Ray 2.52 ๐๐ Validated at 2.4k tokens/H200 on Anyscale Runtime, these patterns maximize sparse MoE model inference efficiency, but often require non-trivial orchestration logic. Hereโs how they
1
15
27
1/n ๐ Introducing SkyRL-Agent, a framework for efficient RL agent training. โก 1.55ร faster async rollout dispatch ๐ Lightweight tool + task integration ๐ Backend-agnostic (SkyRL-train / VeRL / Tinker) ๐ Used to train SA-SWE-32B, improving Qwen3-32B from 24.4% โ 39.4%
5
60
274
๐งโ๐ซOn-Policy Distillation is available as an example on SkyRL! The implementation required no library code changes, and we were able to reproduce AIME math reasoning experiments from the @thinkymachines blogpost. Check out our detailed guide to see how! https://t.co/TqDS649oFQ
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other
4
5
27
๐ฅ New Blog: โDisaggregated Inference: 18 Months Laterโ 18 months in LLM inference feels like a new Mooreโs Law cycle โ but this time not just 2x per year: ๐ธ Serving cost โ10โ100x ๐ Throughput โ10x โก Latency โ5x A big reason? Disaggregated Inference. From DistServe, our
hao-ai-lab.github.io
Eighteen months ago, our lab introduced DistServe with a simple bet: split LLM inference into prefill and decode, and scale them independently on separate compute pools. Today, almost every product...
7
48
175
โ๏ธSkyRL now runs seamlessly with SkyPilot! Let @skypilot_org handle GPU provisioning and cluster setup, so you can focus on RL training with SkyRL. ๐ฏ Launch distributed RL jobs effortlessly โ๏ธ Auto-provision GPUs across clouds ๐ค Train your LLM agents at scale Get started
0
10
25
๐ SkyRL has day-zero support for OpenEnv!! This initial integration with OpenEnv highlights how easily new environments plug into SkyRL. Train your own LLM agents across containerized environments with simple, Gym-style APIs ๐ฅ ๐ Check it out:
Excited to share OpenEnv: frontier-grade RL environments for the open-source community ๐ฅ! https://t.co/KVeBMsxohL ๐งฉ Modular interfaces: a clean Gymnasium-style API (reset(), step(), state()) that plugs into any RL framework ๐ณ Built for scale: run environments in containers
0
10
22
โ๏ธ Engineering for scale and speed. Next-gen RL libraries are redefining how agents learn and interact. ๐ค Open Agent Summit Speakers: โข @sumanthrh (Anyscale) โ โSkyRL: A Modular RL Library for LLM Agentsโ โข @tyler_griggs_ (UC Berkeley) โ โLessons in Agentic RL Modeling and
0
8
57
๐ Excited to release our new paper: โBarbarians at the Gate: How AI is Upending Systems Researchโ We show how AI-Driven Research for Systems (ADRS) can rediscover or outperform human-designed algorithms across cloud scheduling, MoE expert load balancing, LLM-SQL optimization,
8
35
153
SkyRL now supports Megatron! Training massive MoE models demands more than just ZeRO-3/FSDP sharding. The Megatron backend for SkyRL unlocks high throughput training with: โ
5D parallelism (tensor + pipeline + context + expert + data) โ
Efficient training for 30B+ MoEs
2
8
21
Training our advisors was too hard, so we tried to train black-box models like GPT-5 instead. Check out our work: Advisor Models, a training framework that adapts frontier models behind an API to your specific environment, users, or tasks using a smaller, advisor model (1/n)!
16
43
244
when bro suddenly starts submitting PRs with detailed descriptions, correct grammar and casing, and testing steps
164
677
17K
RT @NovaSkyAI: Scaling agentic simulations is hard, so in collaboration with @anyscalecompute we wrote up our experience using Ray for agenโฆ
anyscale.com
Powered by Ray, Anyscale empowers AI builders to run and scale all ML and AI workloads on any cloud and on-prem.
0
1
0
Do you find it challenging to run RL / agent simulations at a large scale (e.g. dealing with docker and remote execution)? Check out our blog post https://t.co/iNPivIzbc2 where we show how to do it with Ray and mini-swe-agent (kudos to @KLieret)
anyscale.com
Powered by Ray, Anyscale empowers AI builders to run and scale all ML and AI workloads on any cloud and on-prem.
0
7
17
๐๐๐๐
SkyRL x Environments Hub is live!! Train on any of 100+ environments natively on SkyRL today โก๏ธ https://t.co/Aj287ZKOsw This was super fun to work on, the @PrimeIntellect team crushed it, go OSS!
0
0
4
An amazing collaboration with the Biomni team! We introduce Biomni-R0, a reasoning-enabled multi-task multi-tool biomedical research agent, trained through end-to-end RL. The results are impressive -- 2x stronger than the base model and >10% better than GPT 5 and Claude 4 Sonnet!
๐ Thrilled to share a preview of Biomni-R0 โ we trained the first RL agent end-to-end for biomedical research. โก๏ธ nearly 2ร stronger than its open-source base โก๏ธ >10% better than frontier closed-source models โก๏ธ Scalable path to hill climb to expert-level performance ๐
4
16
61
2. Online quantization: Policy model runs in BF16, while rollouts run in FP8/ Int8. FlashRL patches vLLMโs weight loading APIs to ensure compatibility. Weโve ported over these patches for v0.9.2 and further optimized weight syncing! Try it out:
0
0
1
Weโve implemented both ingredients in FlashRL: 1. Truncated Importance Sampling: Ensures that the difference in rollout and policy logprobs doesnโt hurt performance This is a simple token-level correction factor for your policy loss. Can help stabilize training even without
1
0
2
SkyRL now has native support for FlashRL! We now support Int8 and FP8 rollouts, enabling blazing fast inference - upto 1.7x for the DAPO recipe - without compromising performance!
โก๐
๐๐ makes RL faster โ but at the cost of performance. We present ๐
๐ฅ๐๐ฌ๐ก๐๐, the first ๐จ๐ฉ๐๐งโ๐ฌ๐จ๐ฎ๐ซ๐๐ & ๐ฐ๐จ๐ซ๐ค๐ข๐ง๐ ๐๐ ๐ซ๐๐๐ข๐ฉ๐ that applies ๐๐๐๐/๐
๐๐ for rollout ๐ฐ๐ข๐ญ๐ก๐จ๐ฎ๐ญ ๐ฅ๐จ๐ฌ๐ข๐ง๐ ๐ฉ๐๐ซ๐๐จ๐ซ๐ฆ๐๐ง๐๐ compared to ๐๐
๐๐! ๐ Blog:
3
21
118
Iโm confused by ppl saying GPT-5 is bad. At least with thinking mode, I find it way more focused, more reasonable, and less rambling. Sometimes I even have to ask it to go into more detail! IMO best model from OAI so far.
102
26
894