Lequn Chen
@abcdabcd987
Followers
1K
Following
310
Media
13
Statuses
45
Faster and cheaper LLM inference.
Seattle, WA
Joined January 2012
Wrote a blog post on why collective communication feels awkward for newer LLM workloads (disaggregated inference, RL weight update, MoE), why people donโt just use raw RDMA, how we approached it, and some behind-the-scenes stories.
le.qun.ch
Last week, our team summarized some recent progress we made on point-to-point communication for LLM systems and posted a paper on arXiv. We also open-sourced the code on GitHub. We built an RDMA co...
2
13
84
Faster than DeepEP for Decode on ConnectX-7. First viable kernel on EFA. SM-Free RDMA transfer. Support prefill. (Maybe portable to other hardware as well)
Perplexity is the first to develop custom Mixture-of-Experts (MoE) kernels that make trillion-parameter models available with cloud platform portability. Our team has published this work on arXiv as Perplexity's first research paper. Read more: https://t.co/SNdgWTeO8F
1
7
33
We divide the weight transfer process into pipeline stages to enable overlapped execution over different hardware resources (CPU->GPU memcpy, GPU computation, RDMA, Ethernet).
1
0
2
We use one-sided RDMA WRITE primitive. It avoids any RPC overheads, temp buffers, or control logic. Inference GPUs won't even know that the weight is updated.
1
0
1
Common approach is to gather all weights to training rank-0, then broadcast to inference workers. Rank-0 becomes the bottleneck (up to 400Gbps = 50GB/s). In contract, we use a P2P pattern to ship weights from all training GPUs to all inference GPUs, leveraging full network fabric
1
0
1
We recently achieved 1.3-second cross-machine parameter update for Kimi-K2 (1T parameters), as opposed to a few minutes in popular frameworks.
1
2
4
How does @deepseek_ai Sparse Attention (DSA) work? It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.
๐ Introducing DeepSeek-V3.2-Exp โ our latest experimental model! โจ Built on V3.1-Terminus, it debuts DeepSeek Sparse Attention(DSA) for faster, more efficient training & inference on long context. ๐ Now live on App, Web, and API. ๐ฐ API prices cut by 50%+! 1/n
11
108
716
Introducing Perplexity Search API We've built a search index of billions of webpages to provide real-time, quality information from the web. Now developers have access to the full power of our index, providing the most accurate results in milliseconds. https://t.co/TDOT8vnWxA
101
259
2K
Just got a sneak peek of the breakout sessions lineup for #RaySummit2025 โ and itโs ๐ฅ Sessions from: ๐น @character_ai on Scaling LLM Post-Training ๐น The State of @vllm_project in 2025 ๐น @Roblox on Training 3D Foundation Models with Ray ๐น @xai on Scaling Image + Video
1
3
12
1.5 seconds is long enough to transfer model weights from training nodes to RL rollout nodes (as opposed to 100s). Here's the full story of how I made it (not just presenting the solution): https://t.co/6zaFAeNICT
8
111
482
GPT-OSS is fast. What's even faster? The development velocity that our inference stack enables! Read More: https://t.co/Jg2BjhrPIv
1
5
37
Checkout the technical deep dive on FlashInfer
๐ Our Deep Dive Blog Covering our Winning MLSys Paper on FlashInfer Is now live โก๏ธ https://t.co/nKHWvRSchK Accelerate LLM inference with FlashInferโNVIDIAโs high-performance, JIT-compiled library built for ultra-efficient transformer inference on GPUs. Go under the hood with
0
5
29
I prefer this UI (win2003 even better) to today's UI. Today's UI feels inconsistent, whitespace is too big, info is hidden in nested menus. Screen and resolution gets bigger and bigger, but information density gets lower and lower.
0
0
5
๐ Congratulations to the FlashInfer team โ their technical paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," just won best paper at #MLSys2025. ๐ ๐ We are excited to share that we are now backing FlashInfer โ a supporter and
github.com
FlashInfer: Kernel Library for LLM Serving. Contribute to flashinfer-ai/flashinfer development by creating an account on GitHub.
4
45
199
Lower latency and Higher throughput -- Get both with multi-node deployment for MoE models like DeepSeek-V3/R1.
@perplexity_ai DeepSeek-V3/R1 contains 671B total parameters but activates only 37B per token. Testing shows EP128 configurations deliver up to 5x higher throughput at equivalent output speeds compared to single-node deployments. Higher EP values assign fewer experts per GPU, reducing memory
0
8
30
10x faster than PyTorch All-to-All. 2x faster than DeepEP on single node. Although 2x slower than DeepEP on 128 GPUs, our impl is less picky about hardware and software. Make your MoE go brrr
github.com
Perplexity GPU Kernels. Contribute to perplexityai/pplx-kernels development by creating an account on GitHub.
We've built custom NVSHMEM-based kernels for Mixture-of-Experts (MoE) models that deliver up to 10x faster communication than standard all-to-all operations. Our approach balances performance with adaptability across different hardware configurations.
0
17
154
We are building our in-house LLM inference stack. Join us if this excites you! And, I have a more in-depth tutorial about achieving 3200 Gbps here:
le.qun.ch
Earlier this year, I had the fortune of joining Perplexity AI, where I finally got to use servers with the most powerful configurationโAWS p5 instances equipped with 8 NVIDIA H100 GPUs interconnect...
Using a custom RDMA-based networking library, we've been able to achieve 3200 Gbps GPU memory transfers, bypassing NCCL limits for 97.1% theoretical bandwidth efficiency. Our latest blog shares our journey of building a custom high-performance networking solution on AWS.
3
23
254