tms_jr Profile Banner
Tyler Michael Smith Profile
Tyler Michael Smith

@tms_jr

Followers
140
Following
1K
Media
78
Statuses
1K

High Performance Computing @neuralmagic | Committer @vllm_project | PhD @UTAustin | Music Enjoyer

Boston, MA
Joined September 2011
Don't wanna be here? Send us removal request.
@vllm_project
vLLM
19 days
Our first official vLLM Meetup is coming to Europe on Nov 6! ๐Ÿ‡จ๐Ÿ‡ญ Meet vLLM committers @mgoin_, @tms_jr, Thomas Parnell, + speakers from @RedHat_AI, @IBM, @MistralAI. Topics: vLLM updates, quantization, Mistral+vLLM, hybrid models, distributed inference https://t.co/itAPmqaqHu
Tweet card summary image
luma.com
Join Us for the First Official vLLM Meetup in Europe! Hosted by Red Hat, IBM, and Mistral AI, this event takes place on 6 November 2025 in Zรผrich and bringsโ€ฆ
2
11
35
@vllm_project
vLLM
1 month
๐Ÿš€ vLLM just hit 60K GitHub stars! ๐ŸŽ‰ From a small research idea to powering LLM inference everywhere โ€” across NVIDIA, AMD, Intel, Apple, TPUs, and more โ€” vLLM now supports almost all major text-generation models and native RL pipelines like TRL, Unsloth, Verl, and OpenRLHF.
11
49
490
@mgoin_
Michael Goin
1 month
Happy that InferenceMAX is here because it signals a milestone for vLLM's SOTA performance on NVIDIA Blackwell! ๐Ÿฅณ It has been a pleasure to deeply collaborate with @nvidia in @vllm_project, and we have much more to do Read about the work we did here:
Tweet card summary image
blog.vllm.ai
Introduction
@dylan522p
Dylan Patel
1 month
Today we are launching InferenceMAX! We have support from Nvidia, AMD, OpenAI, Microsoft, Pytorch, SGLang, vLLM, Oracle, CoreWeave, TogetherAI, Nebius, Crusoe, HPE, SuperMicro, Dell It runs every day on the latest software (vLLM, SGLang, etc) across hundreds of GPUs, $10Ms of
4
21
95
@RedHat_AI
Red Hat AI
2 months
Qwen3-VL is now ready for experimentation in Red Hat AI Inference Server with vLLM. ๐Ÿ‘‰ https://t.co/e5iMafg0Kd This builds on our recent step-by-step guide to Qwen3-Next and extends support to the new vision-language model, Qwen3-VL: https://t.co/qsOLageHko Whatโ€™s next for our
Tweet card summary image
catalog.redhat.com
rhaiis-preview/vllm-cuda-rhel9
1
4
19
@vllm_project
vLLM
2 months
How does @deepseek_ai Sparse Attention (DSA) work? It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.
@deepseek_ai
DeepSeek
2 months
๐Ÿš€ Introducing DeepSeek-V3.2-Exp โ€” our latest experimental model! โœจ Built on V3.1-Terminus, it debuts DeepSeek Sparse Attention(DSA) for faster, more efficient training & inference on long context. ๐Ÿ‘‰ Now live on App, Web, and API. ๐Ÿ’ฐ API prices cut by 50%+! 1/n
11
108
716
@mgoin_
Michael Goin
2 months
Just enabled full cudagraphs by default on @vllm_project! This change should offer a huge improvement for low latency workloads on small models and efficient MoEs For Qwen3-30B-A3B-FP8 on H100 at bs=10 1024/128, I was able to see a speedup of 47% ๐Ÿ”ฅ
6
7
67
@cloud11665
cloud
5 months
1
2
23
@NYCSanitation
NYC Sanitation
10 months
In 1991, David Lynch showed the world the alienation and innate horror of a dirty street, directing this unforgettable anti-littering ad for the City of New York. RIP to a visionary filmmaker and a pioneer of the Trash Revolution.
44
2K
9K
@addvin
brian stevens
1 year
Iโ€™m thrilled to announce that Neural Magic has signed a definitive agreement to join forces with Red Hat, Inc. At Neural Magic our vision is that the future of AI is open, and we have been on a mission to enable enterprises to capture the powerful innovation from AI, while at
17
35
128
@tszzl
roon
1 year
a fact of the world that we have to live with: models when โ€œjailbrokenโ€ seem to have a distinct personality and artistic capability well beyond anything they produce in their default mood this might be the most important alignment work in the world and is mostly done on discord
130
188
4K
@tms_jr
Tyler Michael Smith
1 year
Read to learn about Machete, which will serve as a foundation for mixed-input quantized GEMMs on NVIDIA GPUs (Hopper and later!) inside of vLLM Excellent work and stellar animations by Lucas Wilkinson ( https://t.co/YjoZvUzDHY)
github.com
LucasWilkinson has 33 repositories available. Follow their code on GitHub.
@RedHat_AI
Red Hat AI
1 year
1/8 ๐ŸŽ‰ Introducing Machete, our new mixed-input GEMM kernel for NVIDIA Hopper GPUs! ๐Ÿš€ It brings 29% faster input and 32% faster output token throughput for Llama 3.1 70B, with a TTFT of <250ms on a single H100 GPU. Hereโ€™s how it boosts LLM serving performance ๐Ÿ‘‡
0
0
2
@_marcsun
Marc Sun
1 year
Quantization update! Transformers is now compatible with models quantized with llm-compressor library from @vllm_project or models in compressed-tensors format. This means that you can also enjoy high quality quantized models from the @neuralmagic team!
2
16
64
@tms_jr
Tyler Michael Smith
1 year
if they play the same song more than once the concert is going to be great case in point: https://t.co/RfQK5VJxgG
0
0
1
@tms_jr
Tyler Michael Smith
1 year
then in the encore he covered my heart will go on 10/10 very good show extremely good bit
1
0
1
@tms_jr
Tyler Michael Smith
1 year
1
0
0
@tms_jr
Tyler Michael Smith
1 year
last night i saw Mk.gee โ€” totally solid show that got Good when he played the same song 3 times in a row - and then he played it one more time
1
0
2
@tms_jr
Tyler Michael Smith
1 year
this is an MJ Lenderman stan account https://t.co/cxa5unubWB
0
0
0
@vllm_project
vLLM
1 year
A month ago, we announced our performance roadmap. Today, we are happy to share that the latest release achieves ๐Ÿš€2.7x higher throughput and is 5x faster for output latency on Llama 8B, and 1.8x higher throughput and 2x faster on Llama 70B for H100s. https://t.co/QWTT5cyvKw
Tweet card summary image
blog.vllm.ai
TL;DR: vLLM achieves 2.7x higher throughput and 5x faster TPOT (time per output token) on Llama 8B model, and 1.8x higher throughput and 2x less TPOT on Llama 70B model.
14
69
378
@RedHat_AI
Red Hat AI
1 year
Last week's vLLM office hours recording is ready! ๐ŸŽฅ @tms_jr showed how to use NVIDIA CUTLASS for high-performance inference in @vllm_project. We also explored the exciting vLLM v0.6.0 updates that led to a 2.7x throughput boost and 5x latency improvement. Recording & slides ๐Ÿ‘‡
2
5
11