Jian Chen
@jianchen1799
Followers
130
Following
56
Media
1
Statuses
19
Ph.D. Student @HDSIUCSD; Efficient ML systems and algorithms
San Diego, CA
Joined September 2024
๐Thrilled to share my first work as a PhD student! We propose a new home for Diffusion LLMs: not as competitors to AR models, but as ultra-fast drafters. DFlash is lightweight, cheap to run, and very effective (up to 6x speedup). Itโs super easy to set upโgive it a try!
Holiday cooking finally ready to serve! ๐ฅณ Introducing DFlash โ speculative decoding with block diffusion. ๐ 6.2ร lossless speedup on Qwen3-8B โก 2.5ร faster than EAGLE-3 Diffusion vs AR doesnโt have to be a fight. At todayโs stage: โข dLLMs = fast, highly parallel, but lossy
3
5
19
๐ง๐ต๐ถ๐ป๐ธ ๐๐ถ๐ฑ๐ฒ๐ฟ. ๐ง๐ต๐ถ๐ป๐ธ ๐๐ต๐ผ๐ฟ๐๐ฒ๐ฟ. ๐ ๐๐ป๐๐ฟ๐ผ๐ฑ๐๐ฐ๐ถ๐ป๐ด ๐ ๐๐น๐๐ถ๐ฝ๐น๐ฒ๐
๐ง๐ต๐ถ๐ป๐ธ๐ถ๐ป๐ด: token-wise branch-and-merge reasoning for LLMs. ๐ธ Discrete CoT is costly. ๐๏ธ Existing continuous tokens often clash with ๐ผ๐ป-๐ฝ๐ผ๐น๐ถ๐ฐ๐ ๐ฅ๐
22
91
672
โก๏ธ Exciting news: DFlash is now on SGLang! We are unlocking new possibilities to accelerate LLM inference. ๐ Stay tunedโmore draft models and optimizations are dropping soon that will seriously speed up your workflows!
โก Speed of flash. Just 2 days after launch, DFlash is already running in SGLang (@sgl_project). With serving-engine support, we can now unlock speedup with higher concurrency, and weโve quickly worked on a new demo based on it. More releases coming in the next few weeks. Weโre
0
1
3
One of the best ways to reduce LLM latency is by fusing all computation and communication into a single GPU megakernel. But writing megakernels by hand is extremely hard. ๐Introducing Mirage Persistent Kernel (MPK), a compiler that automatically transforms LLMs into optimized
14
125
774
๐ฅ We introduce Multiverse, a new generative modeling framework for adaptive and lossless parallel generation. ๐ Multiverse is the first open-source non-AR model to achieve AIME24 and AIME25 scores of 54% and 46% ๐ Website: https://t.co/J9osByhWUf ๐งต 1/n
6
84
222
๐ฅณ Happy to share our new work โ ย Kinetics: Rethinking Test-Time Scaling Laws ๐คHow to effectively build a powerful reasoning agent? Existing compute-optimal scaling laws suggest 64K thinking tokens + 1.7B model > 32B model. But, It only shows half of the picture! ๐จ The O(Nยฒ)
7
69
247
๐จ Why Do Multi-Agent LLM Systems Fail? โ๏ธ ๐ฅ Introducing MAST: The first multi-agent failure taxonomy - consists of 14 failure modes and 3 categories, generalizes for diverse multi-agent systems and tasks! Paper: https://t.co/BC5YHS8ZRZ Code: https://t.co/Ea1FvGcaLs ๐งต1/n
7
59
217
๐จ Thrilled to present our Spotlight at #ICLR2025: "MagicPIG: LSH Sampling for Efficient LLM Generation" by @RJ_Sadhukhan ๐ ๐ก MagicPIG enables KV compression for long-context LLMs โ where top-k falls short, sampling shines. โ๏ธ Introduces CPU-GPU heterogeneous serving to boost
1
5
10
๐ Low latency, high throughput, and lossless LLM inference โ all at once. ๐ Who says you canโt have your cake and eat it too? #MagicDec proves you can. โก MagicDec first shows that speculative decoding can boost throughput for moderate to long contexts โ and the speedup grows
๐ฅณPromised blogpost+tweet about MagicDec-1.0๐ช๐ช๐ช 2.0 coming soon๐: How can we achieve Lossless, High Throughput, and Low Latency LLM Inference all at once? Seems too good to be true? Introducing MagicDec-1.0๐ช, a Speculative Decoding (SD) based technique that can improve
1
3
7
๐ RAG vs. Long-Context LLMs: The Real Battle โ๏ธ ๐คฏTurns out, simple-to-build RAG can match million-dollar long-context LLMs (LC LLMs) on most existing benchmarks. ๐คกSo, do we even need long-context models? YES. Because todayโs benchmarks are flawed: โณ Too Simple โ
6
39
188