abcdabcd987 Profile Banner
Lequn Chen Profile
Lequn Chen

@abcdabcd987

Followers
1K
Following
310
Media
13
Statuses
45

Faster and cheaper LLM inference.

Seattle, WA
Joined January 2012
Don't wanna be here? Send us removal request.
@abcdabcd987
Lequn Chen
19 hours
zhihu:
0
0
1
@abcdabcd987
Lequn Chen
19 hours
Wrote a blog post on why collective communication feels awkward for newer LLM workloads (disaggregated inference, RL weight update, MoE), why people donโ€™t just use raw RDMA, how we approached it, and some behind-the-scenes stories.
le.qun.ch
Last week, our team summarized some recent progress we made on point-to-point communication for LLM systems and posted a paper on arXiv. We also open-sourced the code on GitHub. We built an RDMA co...
2
13
84
@abcdabcd987
Lequn Chen
6 days
Faster than DeepEP for Decode on ConnectX-7. First viable kernel on EFA. SM-Free RDMA transfer. Support prefill. (Maybe portable to other hardware as well)
@perplexity_ai
Perplexity
6 days
Perplexity is the first to develop custom Mixture-of-Experts (MoE) kernels that make trillion-parameter models available with cloud platform portability. Our team has published this work on arXiv as Perplexity's first research paper. Read more: https://t.co/SNdgWTeO8F
1
7
33
@abcdabcd987
Lequn Chen
1 month
Read more in the blog post!
Tweet card summary image
research.perplexity.ai
Ultra-fast cross-GPU model sync
0
0
1
@abcdabcd987
Lequn Chen
1 month
We divide the weight transfer process into pipeline stages to enable overlapped execution over different hardware resources (CPU->GPU memcpy, GPU computation, RDMA, Ethernet).
1
0
2
@abcdabcd987
Lequn Chen
1 month
We use one-sided RDMA WRITE primitive. It avoids any RPC overheads, temp buffers, or control logic. Inference GPUs won't even know that the weight is updated.
1
0
1
@abcdabcd987
Lequn Chen
1 month
Common approach is to gather all weights to training rank-0, then broadcast to inference workers. Rank-0 becomes the bottleneck (up to 400Gbps = 50GB/s). In contract, we use a P2P pattern to ship weights from all training GPUs to all inference GPUs, leveraging full network fabric
1
0
1
@abcdabcd987
Lequn Chen
1 month
We recently achieved 1.3-second cross-machine parameter update for Kimi-K2 (1T parameters), as opposed to a few minutes in popular frameworks.
1
2
4
@vllm_project
vLLM
1 month
How does @deepseek_ai Sparse Attention (DSA) work? It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.
@deepseek_ai
DeepSeek
1 month
๐Ÿš€ Introducing DeepSeek-V3.2-Exp โ€” our latest experimental model! โœจ Built on V3.1-Terminus, it debuts DeepSeek Sparse Attention(DSA) for faster, more efficient training & inference on long context. ๐Ÿ‘‰ Now live on App, Web, and API. ๐Ÿ’ฐ API prices cut by 50%+! 1/n
11
108
716
@perplexity_ai
Perplexity
2 months
Introducing Perplexity Search API We've built a search index of billions of webpages to provide real-time, quality information from the web. Now developers have access to the full power of our index, providing the most accurate results in milliseconds. https://t.co/TDOT8vnWxA
101
259
2K
@anyscalecompute
Anyscale
2 months
Just got a sneak peek of the breakout sessions lineup for #RaySummit2025 โ€“ and itโ€™s ๐Ÿ”ฅ Sessions from: ๐Ÿ”น @character_ai on Scaling LLM Post-Training ๐Ÿ”น The State of @vllm_project in 2025 ๐Ÿ”น @Roblox on Training 3D Foundation Models with Ray ๐Ÿ”น @xai on Scaling Image + Video
1
3
12
@abcdabcd987
Lequn Chen
2 months
1.5 seconds is long enough to transfer model weights from training nodes to RL rollout nodes (as opposed to 100s). Here's the full story of how I made it (not just presenting the solution): https://t.co/6zaFAeNICT
8
111
482
@abcdabcd987
Lequn Chen
3 months
GPT-OSS is fast. What's even faster? The development velocity that our inference stack enables! Read More: https://t.co/Jg2BjhrPIv
1
5
37
@tqchenml
Tianqi Chen
5 months
Checkout the technical deep dive on FlashInfer
@NVIDIAAIDev
NVIDIA AI Developer
5 months
๐Ÿ” Our Deep Dive Blog Covering our Winning MLSys Paper on FlashInfer Is now live โžก๏ธ https://t.co/nKHWvRSchK Accelerate LLM inference with FlashInferโ€”NVIDIAโ€™s high-performance, JIT-compiled library built for ultra-efficient transformer inference on GPUs. Go under the hood with
0
5
29
@abcdabcd987
Lequn Chen
6 months
I prefer this UI (win2003 even better) to today's UI. Today's UI feels inconsistent, whitespace is too big, info is hidden in nested menus. Screen and resolution gets bigger and bigger, but information density gets lower and lower.
@PERFECTL00P
โ–‘ perfectloop โ–‘
6 months
๐Ÿ‡ด ๐Ÿ‡ป ๐Ÿ‡ช ๐Ÿ‡ท ๐Ÿ‡ฑ ๐Ÿ‡ด ๐Ÿ‡ฆ ๐Ÿ‡ฉ ๐ŸชŸ๐ŸชŸ๐ŸชŸ๐Ÿซจ
0
0
5
@NVIDIAAIDev
NVIDIA AI Developer
6 months
๐ŸŽ‰ Congratulations to the FlashInfer team โ€“ their technical paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," just won best paper at #MLSys2025. ๐Ÿ† ๐Ÿ™Œ We are excited to share that we are now backing FlashInfer โ€“ a supporter and
Tweet card summary image
github.com
FlashInfer: Kernel Library for LLM Serving. Contribute to flashinfer-ai/flashinfer development by creating an account on GitHub.
4
45
199
@abcdabcd987
Lequn Chen
7 months
It has been such a wonderful year at @perplexity_ai. Keep building ๐Ÿ˜†
13
3
896
@abcdabcd987
Lequn Chen
7 months
Lower latency and Higher throughput -- Get both with multi-node deployment for MoE models like DeepSeek-V3/R1.
@PPLXDevs
Perplexity Developers
7 months
@perplexity_ai DeepSeek-V3/R1 contains 671B total parameters but activates only 37B per token. Testing shows EP128 configurations deliver up to 5x higher throughput at equivalent output speeds compared to single-node deployments. Higher EP values assign fewer experts per GPU, reducing memory
0
8
30
@abcdabcd987
Lequn Chen
7 months
10x faster than PyTorch All-to-All. 2x faster than DeepEP on single node. Although 2x slower than DeepEP on 128 GPUs, our impl is less picky about hardware and software. Make your MoE go brrr
Tweet card summary image
github.com
Perplexity GPU Kernels. Contribute to perplexityai/pplx-kernels development by creating an account on GitHub.
@PPLXDevs
Perplexity Developers
7 months
We've built custom NVSHMEM-based kernels for Mixture-of-Experts (MoE) models that deliver up to 10x faster communication than standard all-to-all operations. Our approach balances performance with adaptability across different hardware configurations.
0
17
154
@abcdabcd987
Lequn Chen
9 months
We are building our in-house LLM inference stack. Join us if this excites you! And, I have a more in-depth tutorial about achieving 3200 Gbps here:
le.qun.ch
Earlier this year, I had the fortune of joining Perplexity AI, where I finally got to use servers with the most powerful configurationโ€”AWS p5 instances equipped with 8 NVIDIA H100 GPUs interconnect...
@perplexity_ai
Perplexity
9 months
Using a custom RDMA-based networking library, we've been able to achieve 3200 Gbps GPU memory transfers, bypassing NCCL limits for 97.1% theoretical bandwidth efficiency. Our latest blog shares our journey of building a custom high-performance networking solution on AWS.
3
23
254