Lequn Chen @abcdabcd987 X Profile

Lequn Chen

@abcdabcd987

Followers

914

Following

254

Media

6

Statuses

28

Faster and cheaper LLM inference.

Seattle, WA

Joined January 2012

Don't wanna be here? Send us removal request.

Lequn Chen

@abcdabcd987

1 month

RT @tqchenml: Checkout the technical deep dive on FlashInfer.

0

4

0

Lequn Chen

@abcdabcd987

2 months

I prefer this UI (win2003 even better) to today's UI. Today's UI feels inconsistent, whitespace is too big, info is hidden in nested menus. Screen and resolution gets bigger and bigger, but information density gets lower and lower.

░ perfectloop ░

@PERFECTL00P

2 months

🇴 🇻 🇪 🇷 🇱 🇴 🇦 🇩 🪟🪟🪟🫨

0

3

Lequn Chen

@abcdabcd987

2 months

RT @NVIDIAAIDev: 🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine….

0

47

0

Lequn Chen

@abcdabcd987

3 months

It has been such a wonderful year at @perplexity_ai. Keep building 😆

13

3

905

Lequn Chen

@abcdabcd987

3 months

Lower latency and Higher throughput -- Get both with multi-node deployment for MoE models like DeepSeek-V3/R1.

Perplexity Developers

@PPLXDevs

3 months

@perplexity_ai DeepSeek-V3/R1 contains 671B total parameters but activates only 37B per token. Testing shows EP128 configurations deliver up to 5x higher throughput at equivalent output speeds compared to single-node deployments. Higher EP values assign fewer experts per GPU, reducing memory

0

8

31

Lequn Chen

@abcdabcd987

4 months

10x faster than PyTorch All-to-All. 2x faster than DeepEP on single node. Although 2x slower than DeepEP on 128 GPUs, our impl is less picky about hardware and software. Make your MoE go brrr

Perplexity Developers

@PPLXDevs

4 months

We've built custom NVSHMEM-based kernels for Mixture-of-Experts (MoE) models that deliver up to 10x faster communication than standard all-to-all operations. Our approach balances performance with adaptability across different hardware configurations.

0

17

153

Lequn Chen

@abcdabcd987

5 months

We are building our in-house LLM inference stack. Join us if this excites you!. And, I have a more in-depth tutorial about achieving 3200 Gbps here:

Perplexity

@perplexity_ai

5 months

Using a custom RDMA-based networking library, we've been able to achieve 3200 Gbps GPU memory transfers, bypassing NCCL limits for 97.1% theoretical bandwidth efficiency. Our latest blog shares our journey of building a custom high-performance networking solution on AWS.

3

23

255

Lequn Chen

@abcdabcd987

7 months

Start a new year's work with coffee in a Perplexity mug!

Jeremiah Warren ◡̈

@jeremiahjw

7 months

A few photos I took of the @perplexity_ai mugs and the coffee bag designed by @RypeArts ☕️

1

0

43

Lequn Chen

@abcdabcd987

10 months

RT @perplexity_ai: Ask Jim Harbaugh anything.

0

184

0

Lequn Chen

@abcdabcd987

1 year

RT @perplexity_ai: We’re excited to announce an updated version of our Pro Search that can perform deeper research on more complex queries….

0

191

0

Lequn Chen

@abcdabcd987

1 year

RT @perplexity_ai:

0

61

0

Lequn Chen

@abcdabcd987

1 year

RT @luisceze: Go @abcdabcd987 (Lequn Chen)! Great work on making lots LoRAs cheap to serve. Nice collaboration with @ye_combinator @arvin….

0

2

0

Lequn Chen

@abcdabcd987

1 year

RT @ye_combinator: (1/3) Memory Bandwidth Efficient Shared Prefix Batch Decoding, brought to you by FlashInfer: . blog: .

0

21

0

Lequn Chen

@abcdabcd987

1 year

🚀FlashInfer: Highly optimized Attention kernel for {single, batch} x {prefill, decode, append} x {ragged tensor, paging} x {FP16, FP8, INT4} x {4090, Ada6000, A100, H100}.🔥Python Wheels available! Check it out!.

Zihao Ye

@ye_combinator

1 year

(1/4) Announcing FlashInfer, a kernel library that provides state-of-the-art kernel implementations for LLM Inference/Serving. FlashInfer's unique features include:.- Comprehensive Attention Kernels: covering prefill/decode/append attention for various KV-Cache formats (Page

0

2

16

Lequn Chen

@abcdabcd987

2 years

RT @bariskasikci: We just released the source code for Atom, an efficient and accurate quantization algorithm for Large Language Model serv….

0

43

0

Lequn Chen

@abcdabcd987

2 years

Really good observation from @tianle_cai and @junrushao . I did a quick sanity check. Delta between Mixtral 8x7B MoE and Mistral 7B is NOT low-rank. SGMV is not applicable here. We need new research :)

Junru Shao

@junrushao

2 years

Workload-wise, the dynamic routing operator per se in MoE is pretty similar to the SGMV kernel in multi-LoRA serving as introduced in Pinuca @abcdabcd987 @ye_combinator :)).

1

0

7

Lequn Chen

@abcdabcd987

2 years

Just made a demo: use Punica to serve multiple LoRA finetuned LLMs at the cost of one!. Previously:

Lequn Chen

@abcdabcd987

2 years

🤔Assuming a large language model application takes 5 GPUs to serve, does it require 50 GPUs to serve 10 different LLM apps? .🌟Out latest research project, Punica, enables serving multiple LoRA finetuned LLMs at the cost of one!.

2

3

35

Lequn Chen

@abcdabcd987

2 years

🚀Punica is able to deliver 12x throughout compared to state-of-the-art LLM serving systems. 📄Dive into our paper: . .💻Explore our code: 🗨️Join the HackerNews discussion: #LLM #LoRA.

2

3

16

Lequn Chen

@abcdabcd987

2 years

How? We developed a CUDA kernel, called SGMV, that efficiency runs different LoRA models in a batch. SGMV enables strong batching effect. 🔥Increasing batch size does not increase latency significantly. You can run multiple LoRA models at the cost of one.

1

14

Lequn Chen

@abcdabcd987

2 years

🤔Assuming a large language model application takes 5 GPUs to serve, does it require 50 GPUs to serve 10 different LLM apps? .🌟Out latest research project, Punica, enables serving multiple LoRA finetuned LLMs at the cost of one!.

8

47

311