Luis Ceze @luisceze X Profile

Luis Ceze

@luisceze

Followers

4K

Following

5K

Media

156

Statuses

1K

computer architect. marveled by biology. professor @uwcse. ceo @OctoAICloud. venture partner @madronaventures.

https://t.co/RDEIs5Yla0

?

Joined May 2010

Don't wanna be here? Send us removal request.

Luis Ceze

@luisceze

19 days

FlashInfer Bench’s evaluation of kernels with real-world setups will accelerate development of kernels by both humans and agents - so cool! Can’t wait to see the advances that will come out of it.

Tianqi Chen

@tqchenml

19 days

🚀Excited to launch FlashInfer Bench. We believe AI has the potential to help build LLM systems . To accelerate the path, we need an open schema for critical workloads and an AI-driven virtuous circle. First-class integration with FlashInfer, SGLang and vLLM support👉

0

5

18

Shanli Xing

@shanli_xing

20 days

🤔 Can AI optimize the systems it runs on? 🚀 Introducing FlashInfer-Bench, a workflow that makes AI systems self-improving with agents: - Standardized signature for LLM serving kernels - Implement kernels with your preferred language - Benchmark them against real-world serving

3

45

143

Zhihao Jia

@JiaZhihao

25 days

The #MLSys2026 submission deadline is only 2 weeks away (Oct 30)! Submit your best work on ML systems — spanning hardware, compilers, software, models, agents, and eval. This year features both Research and Industry Tracks! Join us in Seattle next spring!

0

13

20

The AI Investor

@The_AI_Investor

1 month

AMD Instinct MI355X was supposed to compete with NVIDIA Blackwell right? So much for AMD having an advantage in inference.

63

72

455

Zihao Ye

@ye_combinator

6 months

We’re thrilled that FlashInfer won a Best Paper Award at MLSys 2025! 🎉 This wouldn’t have been possible without the community — huge thanks to @lmsysorg’s sglang for deep co-design (which is crtical for inference kernel evolution) and stress-testing over the years, and to

NVIDIA AI Developer

@NVIDIAAIDev

6 months

🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," just won best paper at #MLSys2025. 🏆 🙌 We are excited to share that we are now backing FlashInfer – a supporter and

15

39

235

Ying Sheng

@ying11231

6 months

Congrats to @ye_combinator @tqchenml @luisceze! Flashinfer has been the real power behind various inference frameworks! Hope to see more people joining the community and build your own inference engines on top of it!

Zihao Ye

@ye_combinator

6 months

We’re thrilled that FlashInfer won a Best Paper Award at MLSys 2025! 🎉 This wouldn’t have been possible without the community — huge thanks to @lmsysorg’s sglang for deep co-design (which is crtical for inference kernel evolution) and stress-testing over the years, and to

1

4

55

Luis Ceze

@luisceze

6 months

🚀🎉

NVIDIA AI Developer

@NVIDIAAIDev

6 months

🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," just won best paper at #MLSys2025. 🏆 🙌 We are excited to share that we are now backing FlashInfer – a supporter and

1

3

11

Mahmoud Soliman

@mjsMLP

7 months

@0xA95 @seanprime7 @vinodg ‘s work is finally out. Kick the tires and let them know what do you think!

Cristian Garcia

@cgarciae88

7 months

new JAX MPMD library from Nvidia

1

6

Zihao Ye

@ye_combinator

8 months

LLM is not all about tensor cores. categorical sampling under filters (top-p/top-k/min-p) are critical operators in llms as vocabulary size grows, flashinfer uses sorting-free rejection sampling algorithm for efficient sampling. checkout this great blog post written by @0xsling0

Shanli Xing

@shanli_xing

8 months

🚀Meet flashinfer.sampling—our sorting-free GPU kernels for lightning-fast #LLM sampling. Our implementation achieves over 50% reduction in sampling time. Blog post: https://t.co/R780Rth03x

0

9

39

Shanli Xing

@shanli_xing

8 months

🚀Meet flashinfer.sampling—our sorting-free GPU kernels for lightning-fast #LLM sampling. Our implementation achieves over 50% reduction in sampling time. Blog post: https://t.co/R780Rth03x

1

33

183

Tianqi Chen

@tqchenml

8 months

Learn more about the latest advances in AI and systems, including LLM serving, efficient attentions, structured outputs, scaling up training, and more topics. Check out #MLSys2025. Accepted papers at https://t.co/sTsbrxWHlw and register today at https://t.co/2iRbuiDirc

4

25

103

Zihao Ye

@ye_combinator

8 months

Check out the intra-kernel profiler in flashinfer to visualize the timeline of each SM/warpgroup in the lifecycle of a CUDA persistent kernel: https://t.co/aA8Mbe7nyq You can clearly understand how tensor/cuda cores overlapping, variable length load-balancing and fusion works.

2

34

148

Luis Ceze

@luisceze

11 months

Amazing to see Flashinfer’s traction in the short 8mo since it was first introduced. Try out the latest release.

Zihao Ye

@ye_combinator

11 months

We are excite to announce FlashInfer v0.2! Core contributions of this release include: - Block/Vector Sparse (Paged) Attention on FlashAttention-3 - JIT compilation for customized attention variants - Fused Multi-head Latent Attention (MLA) decoding kernel - Lots of bugfix and

0

3

20

Luis Ceze

@luisceze

1 year

Fascinating to read about this analysis of how telenovelas have such a deep impact on real world culture — I’m brazilian :). As a computer scientist, reading TRIBAL by @MichaelMorrisCU makes me wonder about culture impact on AI and its co-evolution with human culture.

Michael Morris, Professor at Columbia University

@MichaelMorrisCU

1 year

📺Day 7: Fictional Characters and Real Change 📺 From Will & Grace to Brazilian telenovelas, widely watched dramas can precipitate dramatic cultural shifts. NGOs promoting public health changes have employed serial dramas to shift cultural ideals and personal decisions. But

0

8

Luis Ceze

@luisceze

1 year

Great to see @OctoAICloud only second to @GroqInc -- given our service runs on off-the-cloud-shelf @nvidia hardware. It is all about carefully balancing speed, quality and cost in from a whole-system, cross-stack perspective.

Alex Volkov (Thursd/AI)

@altryne

1 year

Wanna know whether different LLM providers serve the same LLama 3.1 70B? I sure did! So I ran a quick eval to get some surprising results + open sourced my code 👇 Check out my comparison between @GroqInc @FireworksAI_HQ @OctoAICloud @DeepInfra and @togethercompute

1

2

11

Luis Ceze

@luisceze

1 year

Huge achievement by the @AIatMeta team on launching the Llama 3.1 models! The quality benchmarks look incredible, our customers are going to be really excited for the whole Llama 3.1 herd. Learn more and try them on @OctoAICloud here: https://t.co/BB1lZZpKsT. 🙏🚀🐙

nvidia.com

NVIDIA invents the GPU, creates the largest gaming platform, powers the world’s fastest supercomputer, and drives advances in AI, HPC, gaming, creative design, autonomous vehicles, and robotics.

AI at Meta

@AIatMeta

1 year

Starting today, open source is leading the way. Introducing Llama 3.1: Our most capable models yet. Today we’re releasing a collection of new Llama 3.1 models including our long awaited 405B. These models deliver improved reasoning capabilities, a larger 128K token context

0

9

Tiernan Ray

@TiernanRayTech

1 year

More political deepfakes exist than you think, according to this AI expert With so many elections happening globally this year, TrueMedia founder Oren Etzioni hopes the company's deepfake detection tool can help reduce disinformation. Here's how. https://t.co/FxPGZqKsGo

1

2

8

Luis Ceze

@luisceze

1 year

Go @abcdabcd987 (Lequn Chen)! Great work on making lots LoRAs cheap to serve. Nice collaboration with @ye_combinator @arvind_uw and others! #mlsys24 https://t.co/6TuHxC7R4C