Tyler Michael Smith @tms_jr X Profile

Tyler Michael Smith

@tms_jr

Followers

140

Following

1K

Media

78

Statuses

1K

High Performance Computing @neuralmagic | Committer @vllm_project | PhD @UTAustin | Music Enjoyer

Boston, MA

Joined September 2011

Don't wanna be here? Send us removal request.

vLLM

@vllm_project

19 days

Our first official vLLM Meetup is coming to Europe on Nov 6! 🇨🇭 Meet vLLM committers @mgoin_, @tms_jr, Thomas Parnell, + speakers from @RedHat_AI, @IBM, @MistralAI. Topics: vLLM updates, quantization, Mistral+vLLM, hybrid models, distributed inference https://t.co/itAPmqaqHu

luma.com

Join Us for the First Official vLLM Meetup in Europe! Hosted by Red Hat, IBM, and Mistral AI, this event takes place on 6 November 2025 in Zürich and brings…

2

11

35

vLLM

@vllm_project

1 month

🚀 vLLM just hit 60K GitHub stars! 🎉 From a small research idea to powering LLM inference everywhere — across NVIDIA, AMD, Intel, Apple, TPUs, and more — vLLM now supports almost all major text-generation models and native RL pipelines like TRL, Unsloth, Verl, and OpenRLHF.

11

49

490

Michael Goin

@mgoin_

1 month

Happy that InferenceMAX is here because it signals a milestone for vLLM's SOTA performance on NVIDIA Blackwell! 🥳 It has been a pleasure to deeply collaborate with @nvidia in @vllm_project, and we have much more to do Read about the work we did here:

blog.vllm.ai

Introduction

Dylan Patel

@dylan522p

1 month

Today we are launching InferenceMAX! We have support from Nvidia, AMD, OpenAI, Microsoft, Pytorch, SGLang, vLLM, Oracle, CoreWeave, TogetherAI, Nebius, Crusoe, HPE, SuperMicro, Dell It runs every day on the latest software (vLLM, SGLang, etc) across hundreds of GPUs, $10Ms of

4

21

95

Red Hat AI

@RedHat_AI

2 months

Qwen3-VL is now ready for experimentation in Red Hat AI Inference Server with vLLM. 👉 https://t.co/e5iMafg0Kd This builds on our recent step-by-step guide to Qwen3-Next and extends support to the new vision-language model, Qwen3-VL: https://t.co/qsOLageHko What’s next for our

catalog.redhat.com

rhaiis-preview/vllm-cuda-rhel9

1

4

19

vLLM

@vllm_project

2 months

How does @deepseek_ai Sparse Attention (DSA) work? It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA.

DeepSeek

@deepseek_ai

2 months

🚀 Introducing DeepSeek-V3.2-Exp — our latest experimental model! ✨ Built on V3.1-Terminus, it debuts DeepSeek Sparse Attention(DSA) for faster, more efficient training & inference on long context. 👉 Now live on App, Web, and API. 💰 API prices cut by 50%+! 1/n

11

108

716

Michael Goin

@mgoin_

2 months

Just enabled full cudagraphs by default on @vllm_project! This change should offer a huge improvement for low latency workloads on small models and efficient MoEs For Qwen3-30B-A3B-FP8 on H100 at bs=10 1024/128, I was able to see a speedup of 47% 🔥

6

7

67

Roger Wang

@rogerw0108

2 months

We're ready :D https://t.co/kzwWK1w5gp

github.com

Purpose This PR adds model support for the upcoming Qwen3-VL models, including both dense and MoE variants. Originally authored by @wulipc @JJJYmmm - much thanks to Qwen Team to upstream the model...

Junyang Lin

@JustinLin610

2 months

@nhl_desktop Next week

0

1

3

cloud

@cloud11665

5 months

https://t.co/Jw3uNK4gJd

1

2

23

NYC Sanitation

@NYCSanitation

10 months

In 1991, David Lynch showed the world the alienation and innate horror of a dirty street, directing this unforgettable anti-littering ad for the City of New York. RIP to a visionary filmmaker and a pioneer of the Trash Revolution.

44

2K

9K

brian stevens

@addvin

1 year

I’m thrilled to announce that Neural Magic has signed a definitive agreement to join forces with Red Hat, Inc. At Neural Magic our vision is that the future of AI is open, and we have been on a mission to enable enterprises to capture the powerful innovation from AI, while at

17

35

128

roon

@tszzl

1 year

a fact of the world that we have to live with: models when “jailbroken” seem to have a distinct personality and artistic capability well beyond anything they produce in their default mood this might be the most important alignment work in the world and is mostly done on discord

130

188

4K

Tyler Michael Smith

@tms_jr

1 year

Read to learn about Machete, which will serve as a foundation for mixed-input quantized GEMMs on NVIDIA GPUs (Hopper and later!) inside of vLLM Excellent work and stellar animations by Lucas Wilkinson ( https://t.co/YjoZvUzDHY)

github.com

LucasWilkinson has 33 repositories available. Follow their code on GitHub.

Red Hat AI

@RedHat_AI

1 year

1/8 🎉 Introducing Machete, our new mixed-input GEMM kernel for NVIDIA Hopper GPUs! 🚀 It brings 29% faster input and 32% faster output token throughput for Llama 3.1 70B, with a TTFT of <250ms on a single H100 GPU. Here’s how it boosts LLM serving performance 👇

0

2

Marc Sun

@_marcsun

1 year

Quantization update! Transformers is now compatible with models quantized with llm-compressor library from @vllm_project or models in compressed-tensors format. This means that you can also enjoy high quality quantized models from the @neuralmagic team!

2

16

64

Tyler Michael Smith

@tms_jr

1 year

if they play the same song more than once the concert is going to be great case in point: https://t.co/RfQK5VJxgG

0

1

Tyler Michael Smith

@tms_jr

1 year

then in the encore he covered my heart will go on 10/10 very good show extremely good bit

1

0

1

Tyler Michael Smith

@tms_jr

1 year

this one https://t.co/VKYq9I1X90

1

0

Tyler Michael Smith

@tms_jr

1 year

last night i saw Mk.gee — totally solid show that got Good when he played the same song 3 times in a row - and then he played it one more time

1

0

2

Tyler Michael Smith

@tms_jr

1 year

this is an MJ Lenderman stan account https://t.co/cxa5unubWB

0

vLLM

@vllm_project

1 year

A month ago, we announced our performance roadmap. Today, we are happy to share that the latest release achieves 🚀2.7x higher throughput and is 5x faster for output latency on Llama 8B, and 1.8x higher throughput and 2x faster on Llama 70B for H100s. https://t.co/QWTT5cyvKw

blog.vllm.ai

TL;DR: vLLM achieves 2.7x higher throughput and 5x faster TPOT (time per output token) on Llama 8B model, and 1.8x higher throughput and 2x less TPOT on Llama 70B model.

14

69

378

Red Hat AI

@RedHat_AI

1 year

Last week's vLLM office hours recording is ready! 🎥 @tms_jr showed how to use NVIDIA CUTLASS for high-performance inference in @vllm_project. We also explored the exciting vLLM v0.6.0 updates that led to a 2.7x throughput boost and 5x latency improvement. Recording & slides 👇

2

5

11