
Lequn Chen
@abcdabcd987
Followers
914
Following
254
Media
6
Statuses
28
Faster and cheaper LLM inference.
Seattle, WA
Joined January 2012
RT @NVIDIAAIDev: 🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine….
0
47
0
Lower latency and Higher throughput -- Get both with multi-node deployment for MoE models like DeepSeek-V3/R1.
@perplexity_ai DeepSeek-V3/R1 contains 671B total parameters but activates only 37B per token. Testing shows EP128 configurations deliver up to 5x higher throughput at equivalent output speeds compared to single-node deployments. Higher EP values assign fewer experts per GPU, reducing memory
0
8
31
10x faster than PyTorch All-to-All. 2x faster than DeepEP on single node. Although 2x slower than DeepEP on 128 GPUs, our impl is less picky about hardware and software. Make your MoE go brrr
We've built custom NVSHMEM-based kernels for Mixture-of-Experts (MoE) models that deliver up to 10x faster communication than standard all-to-all operations. Our approach balances performance with adaptability across different hardware configurations.
0
17
153
We are building our in-house LLM inference stack. Join us if this excites you!. And, I have a more in-depth tutorial about achieving 3200 Gbps here:
Using a custom RDMA-based networking library, we've been able to achieve 3200 Gbps GPU memory transfers, bypassing NCCL limits for 97.1% theoretical bandwidth efficiency. Our latest blog shares our journey of building a custom high-performance networking solution on AWS.
3
23
255
Start a new year's work with coffee in a Perplexity mug!
1
0
43
RT @perplexity_ai: We’re excited to announce an updated version of our Pro Search that can perform deeper research on more complex queries….
0
191
0
RT @luisceze: Go @abcdabcd987 (Lequn Chen)! Great work on making lots LoRAs cheap to serve. Nice collaboration with @ye_combinator @arvin….
0
2
0
RT @ye_combinator: (1/3) Memory Bandwidth Efficient Shared Prefix Batch Decoding, brought to you by FlashInfer: . blog: .
0
21
0
🚀FlashInfer: Highly optimized Attention kernel for {single, batch} x {prefill, decode, append} x {ragged tensor, paging} x {FP16, FP8, INT4} x {4090, Ada6000, A100, H100}.🔥Python Wheels available! Check it out!.
(1/4) Announcing FlashInfer, a kernel library that provides state-of-the-art kernel implementations for LLM Inference/Serving. FlashInfer's unique features include:.- Comprehensive Attention Kernels: covering prefill/decode/append attention for various KV-Cache formats (Page
0
2
16
RT @bariskasikci: We just released the source code for Atom, an efficient and accurate quantization algorithm for Large Language Model serv….
0
43
0
Really good observation from @tianle_cai and @junrushao . I did a quick sanity check. Delta between Mixtral 8x7B MoE and Mistral 7B is NOT low-rank. SGMV is not applicable here. We need new research :)
Workload-wise, the dynamic routing operator per se in MoE is pretty similar to the SGMV kernel in multi-LoRA serving as introduced in Pinuca @abcdabcd987 @ye_combinator :)).
1
0
7
Just made a demo: use Punica to serve multiple LoRA finetuned LLMs at the cost of one!. Previously:
🤔Assuming a large language model application takes 5 GPUs to serve, does it require 50 GPUs to serve 10 different LLM apps? .🌟Out latest research project, Punica, enables serving multiple LoRA finetuned LLMs at the cost of one!.
2
3
35