Praveen Gorthy
@praveengorthy
Followers
129
Following
2K
Media
9
Statuses
344
https://t.co/GQkA4KVcRA Engineer by profession. Here for Tech, Programming, Sports, Finance. Strong Beliefs weakly held. In pursuit of going to bed smarter everyday.
Joined May 2010
Bumrah is the larger grey dot. And he's managed this in an era where T20 hitting has undergone a revolution...
16
220
1K
Announcing native LLM APIs in @raydistributed Ray Data and Ray Serve Libraries. These are experimental APIs we are announcing today that abstract two things: 1. Serve LLM: simplifies the deployment of LLM engines (e.g. vLLM) through ray serve APIs. Enables things like
anyscale.com
Try the new LLM APIs available on Ray Data and Ray Serve. It's now easier than ever to use Ray for offline LLM batch inference and online LLM inference.
0
7
36
Scaling AI is hard. Anyscale and Google Cloud make it easier. Read how @anyscalecompute β built by the creators of @raydistributed runs on @Google Compute Engine to help teams scale any AI workload, from LLMs to classic ML. π
cloud.google.com
Without a unified and optimized infrastructure, complexity quickly spirals into excessive cloud spending, resource inefficiencies, and productivity bottlenecks. Enter Ray, the AI compute engine.
0
3
13
Python dependency management has been a longstanding challenge facing AI teams. The uv package manager, built by the team at @astral_sh, goes a long long way toward putting that problem to rest, at least for code running on a single machine. The challenge is even bigger in the
2
7
43
Understanding GPU bottlenecks is easy with a visualisation π¨π»βπ³
25
387
3K
Anyscale is expanding to India! We're opening our first international office. Come work with us to get this office off the ground (DM @jaikumarharikoa).
0
1
17
Just sat down to read the DeepSeek-R1 paper. We're entering an era where compute isn't primarily for training. It's for creating better data. I expect to see the money & compute spent on data processing (generation / annotation / curation) grow to match and exceed the money &
31
157
978
New to Ray Train? The @Anyscalecompute team just shared an amazing presentation at #RaySummit2024, unveiling the fully built Ray Train Dashboard π€© With detailed insights into resource utilization, training throughput, and even profiling tools to debug bottlenecks, itβs built to
0
3
4
OpenAI, Uber, and Netflix all use Ray to scale their AI workflows. From distributed data preprocessing to LLM serving, Ray does it all. I wrote about what Ray is and why it matters in the age of LLMs link in the comments π
1
8
40
Here is the chain of thought π€ 1β£ Many companies have a lot of data. 2β£ The point of having this data is to use it to get insights and make decisions. 3β£ Today, the primary way that companies do that is through data analytics. Running SQL queries and simple analytics. 4β£ In the
Most (all?) LLM performance benchmarks like @ArtificialAnlys go in depth on *online* inference. *Batch* inference seems simpler since almost all companies run some form of embarrassingly parallel workloads. But batch inference is different from other map-reduce style workloads.
2
7
29
With todayβs release, vLLM 0.6.0 gives users a huge performance boost compared to 0.5.0. Anyscale is happy to have contributed batch scheduling to vLLM this release, which improved request throughput on Llama3-8b by 70%. Shout out to other contributors (@neuralmagic,
A month ago, we announced our performance roadmap. Today, we are happy to share that the latest release achieves π2.7x higher throughput and is 5x faster for output latency on Llama 8B, and 1.8x higher throughput and 2x faster on Llama 70B for H100s. https://t.co/QWTT5cyvKw
0
10
28
In 5 of 8 recent conversations, ML platform leaders told me that their top priority over the next 6 months is to enable training on more data (e.g., an order of magnitude more). Why? Scaling laws. The idea that larger models + data + compute can lead to better results (not just
1
14
32
π£π£π£ Meta-LLama-3.1-405B is now available on Anyscale! Get started here: https://t.co/8dJcU4aU9M Video:
3
11
23
Huge release from Meta! You can spin up Llama 405B on @anyscalecompute in minutes with @raydistributed and @vllm_project.
Starting today, open source is leading the way. Introducing Llama 3.1: Our most capable models yet. Today weβre releasing a collection of new Llama 3.1 models including our long awaited 405B. These models deliver improved reasoning capabilities, a larger 128K token context
0
5
19
Weβve recently contributed FP8 support to the @vllm_project in collaboration with @neuralmagic. With this feature, you can see up to a 1.8x reduction in inter-token latency, with >99% accuracy preservation! 1/n
2
32
104
2014 - Man of the Tournament π
2016 - Man of the Tournament π
2022 - Played the greatest knock in T20 WC history 2024 - WC winner & Man of the Match in the final π @deeputalks decodes Virat Kohli's stunning T20I career in numbers here - https://t.co/pGXwllf270
6
596
4K
Tomorrow I'll present a Hacker's Guide to Speculative Decoding in @vllm_project with a focus on enabling external contributors. Topics include proposer/scorer/verifier framework, proposal methods, lookahead scheduling, dynamic speculative decoding, and future contribution ideas.
3
13
110
Chunked prefill expands the Pareto frontier for fast & cheap online continuous batching. Great work in @vllm_project by engineers at @anyscalecompute .
Recently, weβve contributed chunked prefill to @vllm_project, leading to up to 2x speedup for higher QPS regimes! In vLLM, prefilling, which fills the KV cache, and decoding, which outputs new tokens, can interfere with each other, resulting in latency degradation. 1/n
1
3
21
Recently, weβve contributed chunked prefill to @vllm_project, leading to up to 2x speedup for higher QPS regimes! In vLLM, prefilling, which fills the KV cache, and decoding, which outputs new tokens, can interfere with each other, resulting in latency degradation. 1/n
4
22
94