Yifan Qiao @yifandotqiao X Profile

Yifan Qiao

@yifandotqiao

Followers

208

Following

42

Media

1

Statuses

36

Postdoc @UCBerkeley | PhD @UCLA | GPU systems + LLM serving | A student

https://t.co/NezxoTMn6K

Joined January 2023

Don't wanna be here? Send us removal request.

Yifan Qiao

@yifandotqiao

16 days

🚀 End the GPU Cost Crisis Today!!! Headache with LLMs lock a whole GPU but leave capacity idle? Frustrated by your cluster's low utilization? We launch kvcached, the first library for elastic GPU sharing across LLMs. 🔗 https://t.co/3BC7B6s2EX 🧵👇 Why it matters:

9

53

191

Melissa Pan

@melissapan

6 days

The Sky’s Fun Committee, representing the ppl of sky, just dropped the new lab theme: ⚫️💖 Black Pink x Halloween 🎃🦇 We have: - Gru & the minions - kpop ??? 🫰😉

8

52

uccl_project

@uccl_proj

9 days

🚀 Introducing UCCL-EP: A portable, efficient Expert Parallelism framework that brings DeepEP-level GPU-driven communication with the same APIs to any cloud or hardware — AWS EFA, AMD GPUs, Broadcom NICs and beyond. Blog: https://t.co/d3oBVlWezZ Code: https://t.co/0UbCUYz9N9

1

5

6

AI-Driven Research Systems

@ai4research_ucb

7 days

🎯 AI found algorithm beat * NSDI'24 Best Paper * 🤯 [ADRS Blog #2] We use AI to find new spot-instance scheduling algorithms. It beats the original paper algorithm by cutting cloud costs up to 48% (average 27%), while still meeting the job deadlines! ✍️ Read the blog:

0

7

33

NovaSky

@NovaSkyAI

10 days

☁️SkyRL now runs seamlessly with SkyPilot! Let @skypilot_org handle GPU provisioning and cluster setup, so you can focus on RL training with SkyRL. 🎯 Launch distributed RL jobs effortlessly ⚙️ Auto-provision GPUs across clouds 🤖 Train your LLM agents at scale Get started

0

10

23

Yifan Qiao

@yifandotqiao

11 days

Thank you. We are actively pushing this even further. Excited to hear your experience and feedback!

Linus

@Behumbledreal

12 days

During practical scenarios, we observe significant "waste", really love the job of push the edge of optimizing and squeeze the last bit of water out of the sponge

0

3

Melissa Pan

@melissapan

14 days

We are launching a new ADRS blog series to showcase how AI can help systems research 🙌 First up: MoE load balancing ⚖️ AI found algorithmic + engineering optimization! Check out the details in the blog. This case study is led and done by @abmfy_. Bowen is a great undergrad

AI-Driven Research Systems

@ai4research_ucb

14 days

🚀 We used AI to discover a new algorithm for LLM inference, achieving a 5.0x speedup in MoE load balancing over expert-written code. ✍️ Read the details in our blog post: https://t.co/sHVRqX6wDR 📄 Full paper: https://t.co/ex6AidUuwK 💻 Code: https://t.co/o2EVHmFMCl

1

7

40

Simon Mo

@simon_mo_

15 days

Fortunate to be part of two (!) foundation projects (@vllm_project and @raydistributed) that have great synergy with each other. The Ray + vLLM + PyTorch stack is coming together. Congratulations, Ray!

PyTorch

@PyTorch

15 days

We’re excited to welcome Ray to the PyTorch Foundation 👋 @raydistributed is an open source distributed computing framework for #AI workloads, including data processing, model training and inference at scale. By contributing Ray to the @PyTorch Foundation, @anyscalecompute

0

11

91

Yifan Qiao

@yifandotqiao

15 days

Many thanks! Would love to hear how it goes when you try it

utd.dx

@darsh2950

15 days

@yifandotqiao love it!! will give it a shot

0

Yifan Qiao

@yifandotqiao

15 days

Thank you! Would love to hear your thoughts once you try it. Feel free to open an issue or share feedback anytime 🙌

Wenmeng Zhou

@zhouwenmeng

15 days

@yifandotqiao congratulations! nice work. I love the way to use it by just piping install it without any code change. will try it definitely

0

4

Yifan Qiao

@yifandotqiao

15 days

Many thanks for sharing @lmsysorg. We are working hard to bring more features and hardware support to kvcached. KV cache sharing enables many possibilities beyond just the across model case, and we would love to see the community try it out.

LMSYS Org

@lmsysorg

16 days

kvcached enables elastic GPU sharing, and it works out-of-the-box with SGLang ⚡️ Higher utilization, faster serving, zero code change. Come try it

0

2

9

xAlg-ai

@xalg_ai

27 days

Excited to share our new research: vAttention - Verified Sparse Attention. Sparse attention with provable quality guarantees for LLMs. Full paper: https://t.co/pvOSEI8E7J Gibhub: xAlg-ai/sparse-attention-hub 🧵 A thread 👇

arxiv.org

State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top-$k$ (and its extension, top-$p$) and recently introduced sampling-based...

1

9

15

Sijun Tan

@sijun_tan

20 days

I am incredibly excited to introduce rLLM v0.2. Zooming back to a year ago: @OpenAI's o1-preview just dropped, and RL + test-time scaling suddenly became the hype. But no one knew how they did it. @kylepmont and I had this idea - what if we built a solver-critique loop for

rLLM

@rllm_project

20 days

🚀 Introducing rLLM v0.2 - train arbitrary agentic programs with RL, with minimal code changes. Most RL training systems adopt the agent-environment abstraction. But what about complex workflows? Think solver-critique pairs collaborating, or planner agents orchestrating multiple

8

33

304

Wenjie Ma

@wenjie_ma

20 days

LLMs solving math benchmarks with verifiable answers like AIME? ✅ LLMs solving math proofs? ❌ Still an open problem. RL works great for final-answer problems, but proofs are different: - Often no single checkable answer - Correct answers can hide flawed reasoning The key

9

37

188

Andrej Karpathy

@karpathy

17 days

I quite like the new DeepSeek-OCR paper. It's a good OCR model (maybe a bit worse than dots), and yes data collection etc., but anyway it doesn't matter. The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language

vLLM

@vllm_project

18 days

🚀 DeepSeek-OCR — the new frontier of OCR from @deepseek_ai , exploring optical context compression for LLMs, is running blazingly fast on vLLM ⚡ (~2500 tokens/s on A100-40G) — powered by vllm==0.8.5 for day-0 model support. 🧠 Compresses visual contexts up to 20× while keeping

568

2K

13K

AI News Bites

@ainewsbites

16 days

New service for GPU sharing across LLMs Introducing kvcached Elastic GPU sharing for LLMs 1 GPU = Multiple LLMs hosted 👇

Yifan Qiao

@yifandotqiao

16 days

🚀 End the GPU Cost Crisis Today!!! Headache with LLMs lock a whole GPU but leave capacity idle? Frustrated by your cluster's low utilization? We launch kvcached, the first library for elastic GPU sharing across LLMs. 🔗 https://t.co/3BC7B6s2EX 🧵👇 Why it matters:

0

1

3

Yifan Qiao

@yifandotqiao

16 days

Thanks a lot for sharing @vllm_project! Thrilled to see our effort on out-of-the-box support paying off. We will keep pushing kvcached forward together with the community, with better performance, richer features, and broader GPU platform support 🔥

vLLM

@vllm_project

16 days

kvcached works directly with vLLM and you can use it to serve multiple models on the same GPU. They will share unused KV cache blocks. Check it out!

1

0

4

Tsung-Han (Patrick) Wu

@tsunghan_wu

16 days

Humans handle dynamic situations easily, what about models? Turns out, they break in three distinct ways: ⛔ Force Stop → Reasoning leakage (won’t stop) ⚡️ Speedup → Panic (rushed answers) ❓ Info Updates → Self-doubt (reject updates) 👉Check out https://t.co/wKrnsMkiFY

5

20

66

Yifan Qiao

@yifandotqiao

16 days

(7/N) Incredibly grateful to the team: @jiarong_Xing, @yifandotqiao, @shanyu_ucla, Xingqi, Mingyuan, Yangmin, @profjoeyg, @istoica05 and other contributors. We're also warmly inviting collaborators to join us in building the foundations of elastic GPU infrastructure.

0

8

Yifan Qiao

@yifandotqiao

16 days

(6/N) Our vision At Berkeley’s Sky Computing Lab @BerkeleySky, we are working towards a GPU "operating system", where compute and memory are dynamically and flexibly shared across models, workloads, and even users.

1

0

9

Yifan Qiao

@yifandotqiao

16 days

(5/N) Please check out our blog for the full story, technical details, and results: 📄

yifanqiao.notion.site

— A library to enable virtualized, elastic KV cache for LLM serving on shared GPUs

1

2

15