Zhuohan Li @zhuohan123 profile

Zhuohan Li

@zhuohan123

Followers

3,271

Following

706

Media

9

Statuses

98

CS PhD 👨🏻‍💻 @ UC Berkeley 🌁 🤖️ Machine Learning Systems Building @vllm_project

https://t.co/JDK04CXAoo

Berkeley, CA

Joined January 2011

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

WE ARE! • 2100593 Tweets

GLAY • 149127 Tweets

ロックの日 • 121152 Tweets

#光る君へ • 108112 Tweets

#ModiCabinet • 85503 Tweets

ムビナナ1周年 • 62061 Tweets

連続テレビ小説 • 30957 Tweets

JAM X ONE31 • 27177 Tweets

GenG • 26351 Tweets

#アンチヒーロー • 21812 Tweets

KALKI TRAILER OUT TOMORROW • 21716 Tweets

IG PPNARAVIT 3M100K • 17857 Tweets

Rashtrapati Bhavan • 17130 Tweets

レイちゃん • 13297 Tweets

うさぎさん • 11582 Tweets

#だれかtoなかい • 11439 Tweets

ONLYBOO LOVE AND DREAM • 11082 Tweets

インソムニア

アナグマ

京ちゃん

里崎セレクション

緑川さん

ピットブル

アリーナツアー

Eveくん

ホールディングス

道長くん

ヤンキー部屋

杏ちゃん

オロロジャイア

わしの妻

洋ちゃん

インドネシア語

ポジティブリング

佐々木蔵之介

まひろちゃん

宣孝さま

TOKAKUKA

平岸天神

明墨先生

国際ロマンス詐欺

メタバース婚活

スマホリング

鬼レンチャン

大橋くん

白木さん

大島優子

#رفع_هشتاق_تريند_Θち5533Θ876

#TEAMポジティブ

#شايب_63عام_ينخاكم

Last Seen Profiles

@PYdiyudie

@Lxz668

@jbs24

@TRussHunter

@kototsi

@tante27875726

@AVI_Roofing_Inc

@lunaeregia

@ks1v

@troyruhanen

@rindubeneran

@CollinDoesCRE

@lucasathletico

@LuxuryLibations

@pinkibot

@StwTua

@dogsunderhood

@ImprovementCym

@honeypot19860

@turk_ifsa2019

Pinned Tweet

Zhuohan Li

@zhuohan123

1 year

🌟 Thrilled to introduce vLLM with @woosuk_k ! 🚀 vLLM is an open-source LLM inference and serving library that accelerates HuggingFace Transformers by 24x and powers @lmsysorg Vicuna and Chatbot Arena. Github: Blog:

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

GitHub | Documentation | Paper

blog.vllm.ai

20

264

1K

Zhuohan Li

@zhuohan123

1 year

Unlock the full potential of model parallelism with AlpaServe 🚀: Besides scaling models beyond one GPU, our new paper shows that model parallelism can process NN serving requests 10x faster even if the models fit into 1 GPU! Paper: 👇 [1/8]

3

20

150

Zhuohan Li

@zhuohan123

9 months

We are excited to announce the first vLLM Bay Area meetup at 6pm on 10/5 (Thu)! Please find the event details and RSVP at: . The vLLM team will give a deep dive of vLLM and show the future roadmap. We will also have vLLM users and contributors share their

The First vLLM Meetup · Luma

Due to the capacity limit of the venue, we closed the registration. Please send us an email at (zhuohan[at]berkeley.edu and woosuk.kwon[at]berkeley.edu) if you…

lu.ma

3

14

117

Zhuohan Li

@zhuohan123

7 months

Excited to see vLLM being the default inference engine for the Microsoft Azure AI model catalog! > Our default choice for serving models is vLLM, which provides high throughput and efficient memory management with continuous batching and Paged Attention. Learn more in the blog

0

5

92

Zhuohan Li

@zhuohan123

9 months

Deeply honored to be the first cohort of the program and a big shout-out to @a16z for setting up the grant and recognizing vLLM! Let's go, open source!

Matt Bornstein

@BornsteinMatt

9 months

[New program] a16z Open Source AI Grants Hackers & independent devs are massively important to the AI ecosystem. We're starting a grant funding program so they can continue their work without pressure to generate financial returns.

71

269

1K

8

3

89

Zhuohan Li

@zhuohan123

1 year

🔥 The core of vLLM is PagedAttention, a novel attention algorithm that brings the classic idea of paging in OS’s virtual memory to LLM serving. Without modifying the model, PagedAttention can batch 5x more sequences together, increasing GPU utilization and thus the throughput.

4

10

82

Zhuohan Li

@zhuohan123

6 months

Excited to have first-hand official support of the Mixtral MoE model in vLLM from @MistralAI ! Getting started with Mixtral with the latest vLLM now: . Be sure to check their announcing blog: Joint with @woosuk_k @PierreStock

Mixtral of experts

A high quality Sparse Mixture-of-Experts.

mistral.ai

Guillaume Lample @ ICLR 2024

@GuillaumeLample

6 months

Very excited to release our second model, Mixtral 8x7B, an open weight mixture of experts model. Mixtral matches or outperforms Llama 2 70B and GPT3.5 on most benchmarks, and has the inference speed of a 12B dense model. It supports a context length of 32k tokens. (1/n)

85

603

4K

0

5

73

Zhuohan Li

@zhuohan123

9 months

PagedAttention's paper is out! Check it out to learn more!

Woosuk Kwon

@woosuk_k

9 months

Exciting news! 🎉Our PagedAttention paper is now up on arXiv! Dive in to learn why it's an indispensable technique for all major LLM serving frameworks. @zhuohan123 and I will present it at @sospconf next month. Blog post: Paper:

2

34

188

2

4

49

Zhuohan Li

@zhuohan123

7 months

We've published a detailed blog post comparing vLLM with DeepSpeed-FastGen. Proud to highlight the unique strengths of vLLM, demonstrating better performance in various scenarios. Blog:

Notes on vLLM v.s. DeepSpeed-FastGen

TL;DR:

blog.vllm.ai

Woosuk Kwon

@woosuk_k

7 months

We’ve just released a new blog post comparing vLLM with DeepSpeed-FastGen. While we are happy to see the open-source technology advancements from the DeepSpeed team, we’ve got different results with more extensive performance benchmarks. vLLM is actually faster than DeepSpeed in

3

30

209

0

2

39

Zhuohan Li

@zhuohan123

6 months

AMD + vLLM = 🚀🚀🚀

AMD

@AMD

6 months

Update: Let's look at some new inference performance data on AMD Instinct MI300X

8

50

250

0

3

38

Zhuohan Li

@zhuohan123

1 year

🦸 vLLM has been the unsung hero behind @lmsysorg Chatbot Arena and Vicuna Demo since April, handling peak traffic & serving popular models with high efficiency. It has cut the number of GPUs used at LMSYS by half while serving an average of 30K conversations daily.

1

3

36

Zhuohan Li

@zhuohan123

3 months

Come and join the third vLLM bay area meetup!

Simon Mo

@simon_mo_

3 months

The vLLM Team is excited to announce our Third vLLM Meetup in San Carlos on April 2nd (Tuesday). We will be discussing feature updates and hear from you! We thank @Roblox for hosting the event!

0

7

26

0

4

29

Zhuohan Li

@zhuohan123

2 years

Excited to present our work on Alpa at #OSDI22 today at 5:25pm with @lm_zheng ! Within a single python decorator, Alpa automatically figures out the best way to parallelize neural networks among all kinds of model parallel strategies. Check Alpa now:

GitHub - alpa-projects/alpa: Training and serving large-scale neural networks with auto paralleli...

Training and serving large-scale neural networks with auto parallelization. - alpa-projects/alpa

github.com

1

2

29

Zhuohan Li

@zhuohan123

1 year

Check out this great blog from anyscale that shows the great performance of vLLM!

Cade Daniel 🇺🇸

@cdnamz

1 year

I wrote about a 23x improvement (!) in LLM live-inference throughput, measured on OPT-13B on A100. There are 2 new innovations which make this possible: Continuous batching & PagedAttention. Short thread below; see writeup, experiments, and results at

2

51

245

0

5

28

Zhuohan Li

@zhuohan123

1 year

This is a joint work of @woosuk_k , @zhuohan123 , @zsy9509 , @ying11231 , @lm_zheng , @CodyHaoYu , @profjoeyg , @haozhangml , Ion Stoica. Check out our blog post and GitHub repo to start using vLLM now! Paper coming soon.

1

3

26

Zhuohan Li

@zhuohan123

2 years

Enojyed #ICML2022 and met lots of new and old friends! Gave a tutorial on large models with @haozhangml @lm_zheng and Ion on Monday (Learn more: ). Will still be around tomorrow and happy to chat!

0

2

25

Zhuohan Li

@zhuohan123

1 year

Checkout this great blogpost from @skypilot_org : SkyPilot + vLLM = fastest and cheapest LLM serving on any cloud!

SkyPilot

@skypilot_org

1 year

UC Berkeley's vLLM + SkyPilot speeds up LLM serving by 24x 🤩 Our user blog post on how SkyPilot combated GPU availability for #vLLM , allowing them to focus on AI and not infra. (Also includes a 1-click guide to run it on your own cloud account!)

1

18

80

1

4

22

Zhuohan Li

@zhuohan123

8 months

Excited to see Lepton @jiayq go beta! Lepton AI has first-class support for vLLM. Launching vLLM with one line on lepton:

Lepton AI

@LeptonAI

8 months

We transparently support PyTorch, HuggingFace Transformer, and other common AI libraries at the base. We also work closely with awesome open source libraries like vLLM - in fact, launching vLLM has never been easier.

1

17

0

2

21

Zhuohan Li

@zhuohan123

2 months

It was a super exciting and rewarding experience to see the community grow! We will continue to grow the community. Come and join the project to make LLM available to everyone!

vLLM

@vllm_project

2 months

We are doubling our committer base for vLLM to ensure it is best-in-class and a truly community effort. This is just a start. Let's welcome @KaichaoYou , @pcmoritz , @nickhill33 , @rogerw0108 , @cdnamz , @robertshaw21 as committers and thank you for your great work! 👏

2

4

32

0

20

Zhuohan Li

@zhuohan123

2 years

Check out our new work Alpa: one-line code change to model parallel deep learning!

Google AI

@GoogleAI

2 years

Alpa is a framework that uses just one line of code to easily automate the complex model parallelism process for large #DeepLearning models. Learn more and check out the code.

6

99

372

0

19

Zhuohan Li

@zhuohan123

12 days

Great work from @anyscalecompute !

Anyscale

@anyscalecompute

12 days

Recently, we’ve contributed chunked prefill to @vllm_project , leading to up to 2x speedup for higher QPS regimes! In vLLM, prefilling, which fills the KV cache, and decoding, which outputs new tokens, can interfere with each other, resulting in latency degradation. 1/n

4

23

96

1

17

Zhuohan Li

@zhuohan123

1 year

Check out our latest Vicuna! Try the demo to see how well it works!

lmsys.org

@lmsysorg

1 year

Introducing Vicuna, an open-source chatbot impressing GPT-4! 🚀 Vicuna reaches 90%* quality of ChatGPT/Bard while significantly outperforming other baselines, according to GPT-4's assessment. Blog: Demo:

58

548

2K

1

0

16

Zhuohan Li

@zhuohan123

27 days

@haozhangml @tianle_cai Probably because most data cleaning pipeline is designed for English?

0

17

Zhuohan Li

@zhuohan123

26 days

Join us in SF on June 11!

vLLM

@vllm_project

26 days

We are holding the 4th vLLM meetup at @Cloudflare with @bentomlai on June 11. Join us to discuss what's next in production LLM serving! Register at

0

8

23

0

16

Zhuohan Li

@zhuohan123

4 years

Check our latest work! We show that accelerate BERT and MT training & inference by _increasing_ model size and stopping early! Blog: Paper: w/ @Eric_Wallace_ , @shengs1123 , @nlpkevinl , Kurt Keutzer, Dan Klein, @mejoeyg

Train Large, Then Compress: Rethinking Model Size for Efficient...

Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We...

arxiv.org

Eric Wallace

@Eric_Wallace_

4 years

Not everyone can afford to train huge neural models. So, we typically *reduce* model size to train/test faster. However, you should actually *increase* model size to speed up training and inference for transformers. Why? [1/6] 👇

16

369

1K

0

1

14

Zhuohan Li

@zhuohan123

5 years

Our new work! Paper can be found at Codes and pretrained models can be found at

GitHub - zhuohan123/macaron-net: Codes for "Understanding and Improving Transformer From a Multi-...

Codes for "Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View" - zhuohan123/macaron-net

github.com

Yiping Lu

@2prime_PKU

5 years

Happy to introduce my new work joint working with @zhuohan123 Understanding the neural network as an ODE, we interpret Transformer in NLP as a multi-particle system. Every word in the sentence is a particle and a numerical scheme splits the convection and diffusion term is used

0

12

0

3

13

Zhuohan Li

@zhuohan123

9 months

@woosuk_k and I will give a talk at the Ray Summit this year about vLLM. Come and talk to us!

Robert Nishihara

@robertnishihara

9 months

Ray Summit this month will be 🔥🔥 🤯 ChatGPT creator @johnschulman2 🧙‍♀️ @bhorowitz on the AI landscape 🦹‍♂️ @hwchase17 on LangChain 🧑‍🚀 @jerryjliu0 on LlamaIndex 👨‍🎤 @zhuohan123 and @woosuk_k on vLLM 🧜 @zongheng_yang on SkyPilot 🧑‍🔧 @MetaAI on Llama-2 🧚‍♂️ @Adobe on Generative AI in

8

45

207

0

1

11

Zhuohan Li

@zhuohan123

2 years

We are also organizing an #ICML22 tutorial on how to train huge neural networks next Monday in Baltimore. Come and learn more about how to scale your fancy neural networks!

2

9

Zhuohan Li

@zhuohan123

5 months

Please come and join us in SF to talk about the exciting future of vLLM!

Simon Mo

@simon_mo_

5 months

We are hosting The Second vLLM Meetup in downtown SF on Jan 31st (Wed). Come to chat with vLLM maintainers about LLMs in production and inference optimizations! Thanks @IBM for hosting us.

3

9

37

0

11

Zhuohan Li

@zhuohan123

1 year

@k3nnethfrancis @woosuk_k @lmsysorg It works for both remote and local models! Will add this into the doc.

2

1

9

Zhuohan Li

@zhuohan123

1 year

Check our paper for more! We will release the code in the alpa project repo soon. Paper: Github: With @lm_zheng , Yinmin Zhong, Vincent Liu, @ying11231 , Xin Jin, @bignamehyp , Zhifeng Chen, @haozhangml , @profjoeyg , Ion Stoica [8/8]

GitHub - alpa-projects/alpa: Training and serving large-scale neural networks with auto paralleli...

Training and serving large-scale neural networks with auto parallelization. - alpa-projects/alpa

github.com

0

2

7

Zhuohan Li

@zhuohan123

10 months

@HamelHusain Great results! Have you considered running throughput benchmarks as well? The current benchmarks focus on latency, which is not vLLM's most optimizations are focusing on :)

0

6

Zhuohan Li

@zhuohan123

1 year

With model parallelism, both GPUs can hold parts of both models. Bursty requests to one model can be processed by both GPUs together. [5/8]

1

0

5

Zhuohan Li

@zhuohan123

1 year

A serving system often needs to serve multiple models at once. Models often receive bursty requests up to 50x the average. [2/8]

1

0

3

Zhuohan Li

@zhuohan123

1 year

In AlpaServe, we use model parallelism to handle these bursty requests. In the following example of serving 2 models on 2 GPUs, with naive placement, each GPU can only fit 1 model. Bursty requests to one model can only be handled by 1 GPU. The other GPU will be idle. [4/8]

1

0

3

Zhuohan Li

@zhuohan123

1 year

Model parallelism partitions a single deep learning model into multiple parts and executes it on distributed devices. It is originally developed to scale large models beyond the memory limits of a single device. [3/8]

1

0

3

Zhuohan Li

@zhuohan123

1 year

We test AlpaServe with production workloads on a 64-GPU cluster. AlpaServe can increase the request processing rate of smaller models by 10×, and larger models at chatGPT scale by 8x. [7/8]

2

0

3

Zhuohan Li

@zhuohan123

1 year

Introducing model parallelism in serving leads to a complex design trade-off space. In AlpaServe, we thoroughly study the space and design novel algorithms to generate efficient model-parallel schedules. [6/8]

1

0

2

Zhuohan Li

@zhuohan123

4 years

Congrats to @wu_kewen , an undergrad from @PKU1898 and an incoming PhD student at @Berkeley_EECS !

Gautam Kamath

@thegautamkamath

4 years

The proceedings for #STOC2020 are now online ()! With them come the best paper awards. The best paper is "Improved bounds for the sunflower lemma," by Ryan Alweiss, Shachar Lovett, Kewen Wu ( @wu_kewen ), and Jiapeng Zhang. (1/2)