luisceze Profile Banner
Luis Ceze Profile
Luis Ceze

@luisceze

Followers
4K
Following
5K
Media
156
Statuses
1K

computer architect. marveled by biology. professor @uwcse. ceo @OctoAICloud. venture partner @madronaventures.

?
Joined May 2010
Don't wanna be here? Send us removal request.
@ye_combinator
Zihao Ye
4 months
We’re thrilled that FlashInfer won a Best Paper Award at MLSys 2025! 🎉 This wouldn’t have been possible without the community — huge thanks to @lmsysorg’s sglang for deep co-design (which is crtical for inference kernel evolution) and stress-testing over the years, and to
@NVIDIAAIDev
NVIDIA AI Developer
4 months
🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," just won best paper at #MLSys2025. 🏆 🙌 We are excited to share that we are now backing FlashInfer – a supporter and
15
37
233
@ying11231
Ying Sheng
4 months
Congrats to @ye_combinator @tqchenml @luisceze! Flashinfer has been the real power behind various inference frameworks! Hope to see more people joining the community and build your own inference engines on top of it!
@ye_combinator
Zihao Ye
4 months
We’re thrilled that FlashInfer won a Best Paper Award at MLSys 2025! 🎉 This wouldn’t have been possible without the community — huge thanks to @lmsysorg’s sglang for deep co-design (which is crtical for inference kernel evolution) and stress-testing over the years, and to
1
4
54
@luisceze
Luis Ceze
4 months
🚀🎉
@NVIDIAAIDev
NVIDIA AI Developer
4 months
🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," just won best paper at #MLSys2025. 🏆 🙌 We are excited to share that we are now backing FlashInfer – a supporter and
1
3
11
@mjsMLP
Mahmoud Soliman
5 months
@0xA95 @seanprime7 @vinodg ‘s work is finally out. Kick the tires and let them know what do you think!
@cgarciae88
Cristian Garcia
5 months
new JAX MPMD library from Nvidia
Tweet media one
1
1
6
@ye_combinator
Zihao Ye
6 months
LLM is not all about tensor cores. categorical sampling under filters (top-p/top-k/min-p) are critical operators in llms as vocabulary size grows, flashinfer uses sorting-free rejection sampling algorithm for efficient sampling. checkout this great blog post written by @0xsling0
@shanli_xing
Shanli Xing
6 months
🚀Meet flashinfer.sampling—our sorting-free GPU kernels for lightning-fast #LLM sampling. Our implementation achieves over 50% reduction in sampling time. Blog post: https://t.co/R780Rth03x
Tweet media one
0
9
39
@shanli_xing
Shanli Xing
6 months
🚀Meet flashinfer.sampling—our sorting-free GPU kernels for lightning-fast #LLM sampling. Our implementation achieves over 50% reduction in sampling time. Blog post: https://t.co/R780Rth03x
Tweet media one
1
33
181
@tqchenml
Tianqi Chen
6 months
Learn more about the latest advances in AI and systems, including LLM serving, efficient attentions, structured outputs, scaling up training, and more topics. Check out #MLSys2025. Accepted papers at https://t.co/sTsbrxWHlw and register today at https://t.co/2iRbuiDirc
Tweet media one
4
25
103
@ye_combinator
Zihao Ye
6 months
Check out the intra-kernel profiler in flashinfer to visualize the timeline of each SM/warpgroup in the lifecycle of a CUDA persistent kernel: https://t.co/aA8Mbe7nyq You can clearly understand how tensor/cuda cores overlapping, variable length load-balancing and fusion works.
Tweet media one
2
31
146
@luisceze
Luis Ceze
9 months
Amazing to see Flashinfer’s traction in the short 8mo since it was first introduced. Try out the latest release.
@ye_combinator
Zihao Ye
9 months
We are excite to announce FlashInfer v0.2! Core contributions of this release include: - Block/Vector Sparse (Paged) Attention on FlashAttention-3 - JIT compilation for customized attention variants - Fused Multi-head Latent Attention (MLA) decoding kernel - Lots of bugfix and
Tweet media one
0
2
19
@luisceze
Luis Ceze
11 months
Fascinating to read about this analysis of how telenovelas have such a deep impact on real world culture — I’m brazilian :). As a computer scientist, reading TRIBAL by @MichaelMorrisCU makes me wonder about culture impact on AI and its co-evolution with human culture.
@MichaelMorrisCU
Michael Morris, Professor at Columbia University
11 months
📺Day 7: Fictional Characters and Real Change 📺 From Will & Grace to Brazilian telenovelas, widely watched dramas can precipitate dramatic cultural shifts. NGOs promoting public health changes have employed serial dramas to shift cultural ideals and personal decisions. But
Tweet media one
0
0
8
@luisceze
Luis Ceze
1 year
Great to see @OctoAICloud only second to @GroqInc -- given our service runs on off-the-cloud-shelf @nvidia hardware. It is all about carefully balancing speed, quality and cost in from a whole-system, cross-stack perspective.
@altryne
Alex Volkov (Thursd/AI)
1 year
Wanna know whether different LLM providers serve the same LLama 3.1 70B? I sure did! So I ran a quick eval to get some surprising results + open sourced my code 👇 Check out my comparison between @GroqInc @FireworksAI_HQ @OctoAICloud @DeepInfra and @togethercompute
1
2
11
@luisceze
Luis Ceze
1 year
Huge achievement by the @AIatMeta team on launching the Llama 3.1 models!  The quality benchmarks look incredible, our customers are going to be really excited for the whole Llama 3.1 herd. Learn more and try them on @OctoAICloud here: https://t.co/BB1lZZpKsT. 🙏🚀🐙
Tweet card summary image
nvidia.com
NVIDIA invents the GPU, creates the largest gaming platform, powers the world’s fastest supercomputer, and drives advances in AI, HPC, gaming, creative design, autonomous vehicles, and robotics.
@AIatMeta
AI at Meta
1 year
Starting today, open source is leading the way. Introducing Llama 3.1: Our most capable models yet. Today we’re releasing a collection of new Llama 3.1 models including our long awaited 405B. These models deliver improved reasoning capabilities, a larger 128K token context
0
0
9
@TiernanRayTech
Tiernan Ray
1 year
More political deepfakes exist than you think, according to this AI expert With so many elections happening globally this year, TrueMedia founder Oren Etzioni hopes the company's deepfake detection tool can help reduce disinformation. Here's how. https://t.co/FxPGZqKsGo
Tweet media one
1
2
8
@luisceze
Luis Ceze
1 year
Go @abcdabcd987 (Lequn Chen)! Great work on making lots LoRAs cheap to serve. Nice collaboration with @ye_combinator @arvind_uw and others! #mlsys24 https://t.co/6TuHxC7R4C
Tweet media one
0
2
20
@luisceze
Luis Ceze
1 year
Great work Yilong, @cylinbao @ye_combinator @bariskasikci and team!
@tqchenml
Tianqi Chen
1 year
Atom: low-bit quantization for efficient and accurate LLM serving. #MLSys2024 bringing efficient and accurate 4bit inference for serving scenarios.
Tweet media one
0
0
4
@tqchenml
Tianqi Chen
1 year
#Llama3 🦙🦙 running fully locally on iPad without internet connnection. credits to @ruihanglai and the team
0
15
73
@tqchenml
Tianqi Chen
1 year
It is amazing how cheap we can go when it comes to running #Llama3 models from @AIatMeta , running on a $100 Orange Pi
@mengshyu
Mengshiun
1 year
Deploy #Llama3 on $100 Orange Pi with GPU acceleration through MLC LLM. Try it out on your Orange Pi 👉 https://t.co/zSJDE3GwUV
Tweet media one
Tweet media two
1
13
69
@alliekmiller
Allie K. Miller
1 year
Fine-tuned open-sourced models are giving the AI giants a run for their money. @mattshumer_, CEO of HyperWrite, and I sat down with @OctoAICloud to talk about the major trends impacting fast-growing AI startups across open source, cost savings, and flexibility. ⏩️ This is
1
11
42
@luisceze
Luis Ceze
1 year
Our SaaS customers love our full-stack approach to generative AI inference that is reliable, customizable, and efficient. OctoStack offers all these benefits directly in your environment - ultra-fast inference, model orchestration, and optimized up/down the stack. 🚀🐙
0
0
3
@luisceze
Luis Ceze
1 year
Same applies to AI-assisted scientific discovery - it fundamentally needs new external inputs to absorb new observations of the universe. So until we are comfortable with automatic experimentation in the real world, I suspect true breakthroughs would be latent.
0
0
0