Yifan Qiao
@yifandotqiao
Followers
208
Following
42
Media
1
Statuses
36
Postdoc @UCBerkeley | PhD @UCLA | GPU systems + LLM serving | A student
Joined January 2023
π End the GPU Cost Crisis Today!!! Headache with LLMs lock a whole GPU but leave capacity idle? Frustrated by your cluster's low utilization? We launch kvcached, the first library for elastic GPU sharing across LLMs. π https://t.co/3BC7B6s2EX π§΅π Why it matters:
9
53
191
The Skyβs Fun Committee, representing the ppl of sky, just dropped the new lab theme: β«οΈπ Black Pink x Halloween ππ¦ We have: - Gru & the minions - kpop ??? π«°π
8
8
52
π Introducing UCCL-EP: A portable, efficient Expert Parallelism framework that brings DeepEP-level GPU-driven communication with the same APIs to any cloud or hardware β AWS EFA, AMD GPUs, Broadcom NICs and beyond. Blog: https://t.co/d3oBVlWezZ Code: https://t.co/0UbCUYz9N9
1
5
6
π― AI found algorithm beat * NSDI'24 Best Paper * π€― [ADRS Blog #2] We use AI to find new spot-instance scheduling algorithms. It beats the original paper algorithm by cutting cloud costs up to 48% (average 27%), while still meeting the job deadlines! βοΈΒ Read the blog:
0
7
33
βοΈSkyRL now runs seamlessly with SkyPilot! Let @skypilot_org handle GPU provisioning and cluster setup, so you can focus on RL training with SkyRL. π― Launch distributed RL jobs effortlessly βοΈ Auto-provision GPUs across clouds π€ Train your LLM agents at scale Get started
0
10
23
We are launching a new ADRS blog series to showcase how AI can help systems research π First up: MoE load balancing βοΈ AI found algorithmic + engineering optimization! Check out the details in the blog. This case study is led and done by @abmfy_. Bowen is a great undergrad
π We used AI to discover a new algorithm for LLM inference, achieving a 5.0x speedup in MoE load balancing over expert-written code. βοΈ Read the details in our blog post: https://t.co/sHVRqX6wDR π Full paper: https://t.co/ex6AidUuwK π» Code: https://t.co/o2EVHmFMCl
1
7
40
Fortunate to be part of two (!) foundation projects (@vllm_project and @raydistributed) that have great synergy with each other. The Ray + vLLM + PyTorch stack is coming together. Congratulations, Ray!
Weβre excited to welcome Ray to the PyTorch Foundation π @raydistributed is an open source distributed computing framework for #AI workloads, including data processing, model training and inference at scale. By contributing Ray to the @PyTorch Foundation, @anyscalecompute
0
11
91
Many thanks! Would love to hear how it goes when you try it
@yifandotqiao love it!! will give it a shot
0
0
0
Thank you! Would love to hear your thoughts once you try it. Feel free to open an issue or share feedback anytime π
@yifandotqiao congratulations! nice work. I love the way to use it by just piping install it without any code change. will try it definitely
0
0
4
Many thanks for sharing @lmsysorg. We are working hard to bring more features and hardware support to kvcached. KV cache sharing enables many possibilities beyond just the across model case, and we would love to see the community try it out.
kvcached enables elastic GPU sharing, and it works out-of-the-box with SGLang β‘οΈ Higher utilization, faster serving, zero code change. Come try it
0
2
9
Excited to share our new research: vAttention - Verified Sparse Attention. Sparse attention with provable quality guarantees for LLMs. Full paper: https://t.co/pvOSEI8E7J Gibhub: xAlg-ai/sparse-attention-hub π§΅ A thread π
arxiv.org
State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top-$k$ (and its extension, top-$p$) and recently introduced sampling-based...
1
9
15
I am incredibly excited to introduce rLLM v0.2. Zooming back to a year ago: @OpenAI's o1-preview just dropped, and RL + test-time scaling suddenly became the hype. But no one knew how they did it. @kylepmont and I had this idea - what if we built a solver-critique loop for
π Introducing rLLM v0.2 - train arbitrary agentic programs with RL, with minimal code changes. Most RL training systems adopt the agent-environment abstraction. But what about complex workflows? Think solver-critique pairs collaborating, or planner agents orchestrating multiple
8
33
304
LLMs solving math benchmarks with verifiable answers like AIME? β
LLMs solving math proofs? β Still an open problem. RL works great for final-answer problems, but proofs are different: - Often no single checkable answer - Correct answers can hide flawed reasoning The key
9
37
188
I quite like the new DeepSeek-OCR paper. It's a good OCR model (maybe a bit worse than dots), and yes data collection etc., but anyway it doesn't matter. The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language
π DeepSeek-OCR β the new frontier of OCR from @deepseek_ai , exploring optical context compression for LLMs, is running blazingly fast on vLLM β‘ (~2500 tokens/s on A100-40G) β powered by vllm==0.8.5 for day-0 model support. π§ Compresses visual contexts up to 20Γ while keeping
568
2K
13K
New service for GPU sharing across LLMs Introducing kvcached Elastic GPU sharing for LLMs 1 GPU = Multiple LLMs hosted π
π End the GPU Cost Crisis Today!!! Headache with LLMs lock a whole GPU but leave capacity idle? Frustrated by your cluster's low utilization? We launch kvcached, the first library for elastic GPU sharing across LLMs. π https://t.co/3BC7B6s2EX π§΅π Why it matters:
0
1
3
Thanks a lot for sharing @vllm_project! Thrilled to see our effort on out-of-the-box support paying off. We will keep pushing kvcached forward together with the community, with better performance, richer features, and broader GPU platform support π₯
kvcached works directly with vLLM and you can use it to serve multiple models on the same GPU. They will share unused KV cache blocks. Check it out!
1
0
4
Humans handle dynamic situations easily, what about models? Turns out, they break in three distinct ways: β Force Stop β Reasoning leakage (wonβt stop) β‘οΈ Speedup β Panic (rushed answers) β Info Updates β Self-doubt (reject updates) πCheck out https://t.co/wKrnsMkiFY
5
20
66
(7/N) Incredibly grateful to the team: @jiarong_Xing, @yifandotqiao, @shanyu_ucla, Xingqi, Mingyuan, Yangmin, @profjoeyg, @istoica05 and other contributors. We're also warmly inviting collaborators to join us in building the foundations of elastic GPU infrastructure.
0
0
8
(6/N) Our vision At Berkeleyβs Sky Computing Lab @BerkeleySky, we are working towards a GPU "operating system", where compute and memory are dynamically and flexibly shared across models, workloads, and even users.
1
0
9
(5/N) Please check out our blog for the full story, technical details, and results: π
yifanqiao.notion.site
β A library to enable virtualized, elastic KV cache for LLM serving on shared GPUs
1
2
15