Tyler Michael Smith
@tms_jr
Followers
140
Following
1K
Media
78
Statuses
1K
High Performance Computing @neuralmagic | Committer @vllm_project | PhD @UTAustin | Music Enjoyer
Boston, MA
Joined September 2011
Our first official vLLM Meetup is coming to Europe on Nov 6! ๐จ๐ญ Meet vLLM committers @mgoin_, @tms_jr, Thomas Parnell, + speakers from @RedHat_AI, @IBM, @MistralAI. Topics: vLLM updates, quantization, Mistral+vLLM, hybrid models, distributed inference https://t.co/itAPmqaqHu
luma.com
Join Us for the First Official vLLM Meetup in Europe! Hosted by Red Hat, IBM, and Mistral AI, this event takes place on 6 November 2025 in Zรผrich and bringsโฆ
2
11
35
๐ vLLM just hit 60K GitHub stars! ๐ From a small research idea to powering LLM inference everywhere โ across NVIDIA, AMD, Intel, Apple, TPUs, and more โ vLLM now supports almost all major text-generation models and native RL pipelines like TRL, Unsloth, Verl, and OpenRLHF.
11
49
490
Happy that InferenceMAX is here because it signals a milestone for vLLM's SOTA performance on NVIDIA Blackwell! ๐ฅณ It has been a pleasure to deeply collaborate with @nvidia in @vllm_project, and we have much more to do Read about the work we did here:
blog.vllm.ai
Introduction
Today we are launching InferenceMAX! We have support from Nvidia, AMD, OpenAI, Microsoft, Pytorch, SGLang, vLLM, Oracle, CoreWeave, TogetherAI, Nebius, Crusoe, HPE, SuperMicro, Dell It runs every day on the latest software (vLLM, SGLang, etc) across hundreds of GPUs, $10Ms of
4
21
95
Qwen3-VL is now ready for experimentation in Red Hat AI Inference Server with vLLM. ๐ https://t.co/e5iMafg0Kd This builds on our recent step-by-step guide to Qwen3-Next and extends support to the new vision-language model, Qwen3-VL: https://t.co/qsOLageHko Whatโs next for our
catalog.redhat.com
rhaiis-preview/vllm-cuda-rhel9
1
4
19
How does @deepseek_ai Sparse Attention (DSA) work? It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.
๐ Introducing DeepSeek-V3.2-Exp โ our latest experimental model! โจ Built on V3.1-Terminus, it debuts DeepSeek Sparse Attention(DSA) for faster, more efficient training & inference on long context. ๐ Now live on App, Web, and API. ๐ฐ API prices cut by 50%+! 1/n
11
108
716
Just enabled full cudagraphs by default on @vllm_project! This change should offer a huge improvement for low latency workloads on small models and efficient MoEs For Qwen3-30B-A3B-FP8 on H100 at bs=10 1024/128, I was able to see a speedup of 47% ๐ฅ
6
7
67
In 1991, David Lynch showed the world the alienation and innate horror of a dirty street, directing this unforgettable anti-littering ad for the City of New York. RIP to a visionary filmmaker and a pioneer of the Trash Revolution.
44
2K
9K
Iโm thrilled to announce that Neural Magic has signed a definitive agreement to join forces with Red Hat, Inc. At Neural Magic our vision is that the future of AI is open, and we have been on a mission to enable enterprises to capture the powerful innovation from AI, while at
17
35
128
a fact of the world that we have to live with: models when โjailbrokenโ seem to have a distinct personality and artistic capability well beyond anything they produce in their default mood this might be the most important alignment work in the world and is mostly done on discord
130
188
4K
Read to learn about Machete, which will serve as a foundation for mixed-input quantized GEMMs on NVIDIA GPUs (Hopper and later!) inside of vLLM Excellent work and stellar animations by Lucas Wilkinson ( https://t.co/YjoZvUzDHY)
github.com
LucasWilkinson has 33 repositories available. Follow their code on GitHub.
1/8 ๐ Introducing Machete, our new mixed-input GEMM kernel for NVIDIA Hopper GPUs! ๐ It brings 29% faster input and 32% faster output token throughput for Llama 3.1 70B, with a TTFT of <250ms on a single H100 GPU. Hereโs how it boosts LLM serving performance ๐
0
0
2
Quantization update! Transformers is now compatible with models quantized with llm-compressor library from @vllm_project or models in compressed-tensors format. This means that you can also enjoy high quality quantized models from the @neuralmagic team!
2
16
64
if they play the same song more than once the concert is going to be great case in point: https://t.co/RfQK5VJxgG
0
0
1
then in the encore he covered my heart will go on 10/10 very good show extremely good bit
1
0
1
last night i saw Mk.gee โ totally solid show that got Good when he played the same song 3 times in a row - and then he played it one more time
1
0
2
this is an MJ Lenderman stan account https://t.co/cxa5unubWB
0
0
0
A month ago, we announced our performance roadmap. Today, we are happy to share that the latest release achieves ๐2.7x higher throughput and is 5x faster for output latency on Llama 8B, and 1.8x higher throughput and 2x faster on Llama 70B for H100s. https://t.co/QWTT5cyvKw
blog.vllm.ai
TL;DR: vLLM achieves 2.7x higher throughput and 5x faster TPOT (time per output token) on Llama 8B model, and 1.8x higher throughput and 2x less TPOT on Llama 70B model.
14
69
378
Last week's vLLM office hours recording is ready! ๐ฅ @tms_jr showed how to use NVIDIA CUTLASS for high-performance inference in @vllm_project. We also explored the exciting vLLM v0.6.0 updates that led to a 2.7x throughput boost and 5x latency improvement. Recording & slides ๐
2
5
11