kourosh hakhamaneshi
@CyrusHakha
Followers
1K
Following
2K
Media
42
Statuses
817
LLMs + Ray @anyscalecompute ๐ป prev PhD, EECS, @UCBerkeley ๐จโ๐
California, USA
Joined September 2010
๐ Exploring Llama-2โs Quality: Can we replace generalist GPT-4 endpoints with specialized OSS models? Dive deep with our technical blogpost to understand the nuances and insights of fine-tuning OSS models. ๐ https://t.co/zVStDCoG4y ๐งตย Thread 1/N๐
anyscale.com
We examine the Llama-2 models under 3 real-world use cases and show that fine-tuning yields significant accuracy improvements.
16
116
527
It has always been insightful to talk with ray developers and how they are solving their infrastructure problems with ray. Last meetup of the year happening today:
Join us for the final Ray Meetup of the year, where we will deep dive with technical talks on core advancements in Ray, as well as discuss what's coming in 2026. ๐ Ray Meetup: A Year of Distributed Systems Innovation (End-of-Year Celebration) ๐๏ธ December 18 โฑ๏ธ 5:30 - 7:30 PM ๐
1
1
3
Join us for the final Ray Meetup of the year, where we will deep dive with technical talks on core advancements in Ray, as well as discuss what's coming in 2026. ๐ Ray Meetup: A Year of Distributed Systems Innovation (End-of-Year Celebration) ๐๏ธ December 18 โฑ๏ธ 5:30 - 7:30 PM ๐
luma.com
Join the Ray OSS community for our final meetup of 2025! As we close out a milestone year for distributed computing, we are bringing everyone together toโฆ
0
1
3
vLLM delivers even more inference performance with the same GPU platform. In just 1 month, we've worked with NVIDIA to increase @nvidia Blackwell maximum throughput per GPU by up to 33% -- significantly reducing cost per token -- while also enabling even higher peak speed for
9
40
325
Watch @richliaw (@anyscalecompute) explain why Ray joined PyTorch Foundation, citing the ecosystem forming around PyTorch, DeepSpeed, and vLLM, and what this move signals about Rayโs role in the AI infrastructure stack. ๐ https://t.co/1cKtUtDmvm
#PyTorch #Ray #AIInfrastructure
0
4
53
vLLM + Ray Serve LLM APIs = ๐! It was an honor to collaborate with the vLLM team to put this together.
Scaling MoE inference is often communication + KV-cache bound: once you push expert parallelism, decode can become dominated by collectives and imbalance, and prefill stragglers can stall an entire EP group. New community benchmark results for vLLM wide-EP on multi-node H200
0
3
8
Scaling MoE inference is often communication + KV-cache bound: once you push expert parallelism, decode can become dominated by collectives and imbalance, and prefill stragglers can stall an entire EP group. New community benchmark results for vLLM wide-EP on multi-node H200
5
38
264
We are happy to announce SkyRL tx 0.2, see our blog post https://t.co/bwn5kBtCf8. It comes with lots of performance improvements, all parts of the execution now use jax jit, so there is very little overhead. Now is probably the best time to try it out if you haven't already ๐งธ
2
8
41
We recently released SkyRL-Train v0.3.0! Highlights include: - Experimental support for Pipeline-RL style Async-RL - Updated E2E Recipes page with Math, Search, SQL runs - Migration from mbridge -> Megatron-Bridge - 14 new OSS contributors! (1/n) ๐งต
2
6
28
The team cooks ๐ฅ Iteration velocity on RL is key to achieving good results. SkyRL is built to modularize RL on LLMs so that researchers can focus on improving the model quality.
Announcing OpenThoughts-Agent with an incredible team โ a data-centric effort on TerminalBench-style tasks, built with SkyRL+Harbor ๐ป๐ค Co-leading the RL team over the past month has been a blast, and weโre just getting started! (1/n) ๐งต
0
1
8
Announcing OpenThoughts-Agent with an incredible team โ a data-centric effort on TerminalBench-style tasks, built with SkyRL+Harbor ๐ป๐ค Co-leading the RL team over the past month has been a blast, and weโre just getting started! (1/n) ๐งต
How can we make a better TerminalBench agent? Today, we are announcing the OpenThoughts-Agent project. OpenThoughts-Agent v1 is the first TerminalBench agent trained on fully open curated SFT and RL environments. OpenThinker-Agent-v1 is the strongest model of its size on
6
14
45
We are starting a recurrent office hour session for our LLM APIs on Ray + vLLM. This week weโll have prefix agenda on wide-EP demo for online inference and distributing batched embedding computation using ray data. Stop by if you are curious about these topics.
Ray Serve/Data LLM office hours tomorrow 12/2, 9:30-10:30a PT. Come through to chat distributed LLM inference ๐ @nikhil_r_ghosh giving away free alpha on batch embeddings workloads; I'll demo the new wide-EP and disaggregated serving APIs for Ray Serve
0
1
5
๐ Join Anyscale at #NeurIPS2025 in San Diego. We'll be gathering a group of researchers, founders, & engineers over food and drinks. We'll be discussing Ray and the frontier of large-scale RL, multimodal model training, and multi-node LLM inference. ๐
Thursday, December 4 ยท
luma.com
Youโre invited to the Anyscale Happy Hour at NeurIPS! Join us for an evening hosted by Anyscale co-founder Robert Nishiharaโa relaxed, high-energy gatheringโฆ
0
3
13
Wise words. Just adding to this, I also think the training infra cost will still be severely dominated by inference cost (rather than pure training) for 1) data curation and synthesis and 2) RL rollouts. So itโs still inference infrastructure that is dominating the foundation.
Is there an AI bubble? With the massive number of dollars going into AI infrastructure such as OpenAIโs $1.4 trillion plan and Nvidia briefly reaching a $5 trillion market cap, many have asked if speculation and hype have driven the values of AI investments above sustainable
1
0
2
Google TPU v6e vs AMD MI300X vs NVIDIA H100/B200: Artificial Analysisโ Hardware Benchmarking shows NVIDIA achieving a ~5x tokens-per-dollar advantage over TPU v6e (Trillium), and a ~2x advantage over MI300X, in our key inference cost metric In our metric for inference cost
66
189
1K
1/n ๐ Introducing SkyRL-Agent, a framework for efficient RL agent training. โก 1.55ร faster async rollout dispatch ๐ Lightweight tool + task integration ๐ Backend-agnostic (SkyRL-train / VeRL / Tinker) ๐ Used to train SA-SWE-32B, improving Qwen3-32B from 24.4% โ 39.4%
5
60
274
Weโve been constantly asked on how to do deepseek-style deployments with ray serve. Ideas like prefill decode dissaggregation, wide-EP, custom request routing for prefill / decode, etc require fair amount of work in the orchestration layer that might be non-trivial. In Ray 2.52
Wide-EP and prefill/decode disaggregation APIs for vLLM are now available in Ray 2.52 ๐๐ Validated at 2.4k tokens/H200 on Anyscale Runtime, these patterns maximize sparse MoE model inference efficiency, but often require non-trivial orchestration logic. Hereโs how they
0
0
1
Wide-EP and prefill/decode disaggregation APIs for vLLM are now available in Ray 2.52 ๐๐ Validated at 2.4k tokens/H200 on Anyscale Runtime, these patterns maximize sparse MoE model inference efficiency, but often require non-trivial orchestration logic. Hereโs how they
1
16
28
Weโre open-sourcing a set of high quality speculator models for Llamas, Qwens, and gpt-oss on Hugging Face. In real workloads, you can expect 1.5 to 2.5x speedups and sometimes more than 4x. Hereโs how this fits into the bigger story for speculative decoding. A thread ๐งต:
5
21
90
Need to customize vLLM? Don't fork it. ๐ vLLM's plugin system lets you inject surgical modifications without maintaining a fork or monkey-patching entire modules. Blog by Dhruvil Bhatt from AWS SageMaker ๐ Why plugins > forks: โข vLLM releases every 2 weeks with 100s of PRs
6
48
396
New Anyscale releases announced at Ray Summit, from Developer Central to Anyscale Runtime to Cluster Controller. Read the roll up blog: https://t.co/iMdQWtp5w5
0
1
6