abcdabcd987 Profile Banner
Lequn Chen Profile
Lequn Chen

@abcdabcd987

Followers
914
Following
254
Media
6
Statuses
28

Faster and cheaper LLM inference.

Seattle, WA
Joined January 2012
Don't wanna be here? Send us removal request.
@abcdabcd987
Lequn Chen
1 month
RT @tqchenml: Checkout the technical deep dive on FlashInfer.
0
4
0
@abcdabcd987
Lequn Chen
2 months
I prefer this UI (win2003 even better) to today's UI. Today's UI feels inconsistent, whitespace is too big, info is hidden in nested menus. Screen and resolution gets bigger and bigger, but information density gets lower and lower.
@PERFECTL00P
░ perfectloop ░
2 months
🇴 🇻 🇪 🇷 🇱 🇴 🇦 🇩 🪟🪟🪟🫨
0
0
3
@abcdabcd987
Lequn Chen
2 months
RT @NVIDIAAIDev: 🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine….
0
47
0
@abcdabcd987
Lequn Chen
3 months
It has been such a wonderful year at @perplexity_ai. Keep building 😆
Tweet media one
Tweet media two
13
3
905
@abcdabcd987
Lequn Chen
3 months
Lower latency and Higher throughput -- Get both with multi-node deployment for MoE models like DeepSeek-V3/R1.
@PPLXDevs
Perplexity Developers
3 months
@perplexity_ai DeepSeek-V3/R1 contains 671B total parameters but activates only 37B per token. Testing shows EP128 configurations deliver up to 5x higher throughput at equivalent output speeds compared to single-node deployments. Higher EP values assign fewer experts per GPU, reducing memory
Tweet media one
0
8
31
@abcdabcd987
Lequn Chen
4 months
10x faster than PyTorch All-to-All. 2x faster than DeepEP on single node. Although 2x slower than DeepEP on 128 GPUs, our impl is less picky about hardware and software. Make your MoE go brrr
@PPLXDevs
Perplexity Developers
4 months
We've built custom NVSHMEM-based kernels for Mixture-of-Experts (MoE) models that deliver up to 10x faster communication than standard all-to-all operations. Our approach balances performance with adaptability across different hardware configurations.
Tweet media one
0
17
153
@abcdabcd987
Lequn Chen
5 months
We are building our in-house LLM inference stack. Join us if this excites you!. And, I have a more in-depth tutorial about achieving 3200 Gbps here:
@perplexity_ai
Perplexity
5 months
Using a custom RDMA-based networking library, we've been able to achieve 3200 Gbps GPU memory transfers, bypassing NCCL limits for 97.1% theoretical bandwidth efficiency. Our latest blog shares our journey of building a custom high-performance networking solution on AWS.
Tweet media one
3
23
255
@abcdabcd987
Lequn Chen
7 months
Start a new year's work with coffee in a Perplexity mug!
Tweet media one
@jeremiahjw
Jeremiah Warren ◡̈
7 months
A few photos I took of the @perplexity_ai mugs and the coffee bag designed by @RypeArts ☕️
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
0
43
@abcdabcd987
Lequn Chen
10 months
RT @perplexity_ai: Ask Jim Harbaugh anything.
0
184
0
@abcdabcd987
Lequn Chen
1 year
RT @perplexity_ai: We’re excited to announce an updated version of our Pro Search that can perform deeper research on more complex queries….
0
191
0
@abcdabcd987
Lequn Chen
1 year
0
61
0
@abcdabcd987
Lequn Chen
1 year
RT @luisceze: Go @abcdabcd987 (Lequn Chen)! Great work on making lots LoRAs cheap to serve. Nice collaboration with @ye_combinator @arvin….
0
2
0
@abcdabcd987
Lequn Chen
1 year
RT @ye_combinator: (1/3) Memory Bandwidth Efficient Shared Prefix Batch Decoding, brought to you by FlashInfer: . blog: .
0
21
0
@abcdabcd987
Lequn Chen
1 year
🚀FlashInfer: Highly optimized Attention kernel for {single, batch} x {prefill, decode, append} x {ragged tensor, paging} x {FP16, FP8, INT4} x {4090, Ada6000, A100, H100}.🔥Python Wheels available! Check it out!.
@ye_combinator
Zihao Ye
1 year
(1/4) Announcing FlashInfer, a kernel library that provides state-of-the-art kernel implementations for LLM Inference/Serving. FlashInfer's unique features include:.- Comprehensive Attention Kernels: covering prefill/decode/append attention for various KV-Cache formats (Page
Tweet media one
0
2
16
@abcdabcd987
Lequn Chen
2 years
RT @bariskasikci: We just released the source code for Atom, an efficient and accurate quantization algorithm for Large Language Model serv….
0
43
0
@abcdabcd987
Lequn Chen
2 years
Really good observation from @tianle_cai and @junrushao . I did a quick sanity check. Delta between Mixtral 8x7B MoE and Mistral 7B is NOT low-rank. SGMV is not applicable here. We need new research :)
Tweet media one
@junrushao
Junru Shao
2 years
Workload-wise, the dynamic routing operator per se in MoE is pretty similar to the SGMV kernel in multi-LoRA serving as introduced in Pinuca @abcdabcd987 @ye_combinator :)).
1
0
7
@abcdabcd987
Lequn Chen
2 years
Just made a demo: use Punica to serve multiple LoRA finetuned LLMs at the cost of one!. Previously:
@abcdabcd987
Lequn Chen
2 years
🤔Assuming a large language model application takes 5 GPUs to serve, does it require 50 GPUs to serve 10 different LLM apps? .🌟Out latest research project, Punica, enables serving multiple LoRA finetuned LLMs at the cost of one!.
Tweet media one
2
3
35
@abcdabcd987
Lequn Chen
2 years
🚀Punica is able to deliver 12x throughout compared to state-of-the-art LLM serving systems. 📄Dive into our paper: . .💻Explore our code: 🗨️Join the HackerNews discussion: #LLM #LoRA.
2
3
16
@abcdabcd987
Lequn Chen
2 years
How? We developed a CUDA kernel, called SGMV, that efficiency runs different LoRA models in a batch. SGMV enables strong batching effect. 🔥Increasing batch size does not increase latency significantly. You can run multiple LoRA models at the cost of one.
Tweet media one
Tweet media two
1
1
14
@abcdabcd987
Lequn Chen
2 years
🤔Assuming a large language model application takes 5 GPUs to serve, does it require 50 GPUs to serve 10 different LLM apps? .🌟Out latest research project, Punica, enables serving multiple LoRA finetuned LLMs at the cost of one!.
Tweet media one
8
47
311