vLLM @vllm_project X Profile

vLLM

@vllm_project

Followers

27K

Following

494

Media

165

Statuses

709

A high-throughput and memory-efficient inference and serving engine for LLMs. Join https://t.co/lxJ0SfX5pJ to discuss together with the community!

https://t.co/KmyOI0Gnbj

Joined March 2024

Don't wanna be here? Send us removal request.

vLLM

@vllm_project

2 days

Low-bit LLM quantization doesn’t have to mean painful accuracy trade-offs or massive tuning runs. Intel's AutoRound PTQ algorithm is now integrated into LLM Compressor, producing W4A16 compressed-tensor checkpoints you can serve directly with vLLM across Intel Xeon, Gaudi, Arc

1

34

230

vLLM

@vllm_project

3 days

Congrats to the @MistralAI team on the launch of Devstral 2! 🚀 vLLM now delivers Day-0 support for the Devstral 2 Instruct models — optimized for agentic coding, deep codebase exploration, and multi-file editing at scale. Feel free to reach out 👇

Mistral AI

@MistralAI

3 days

Introducing the Devstral 2 coding model family. Two sizes, both open source. Also, meet Mistral Vibe, a native CLI, enabling end-to-end automation. 🧵

2

27

293

Nebius Token Factory

@nebiustf

21 days

In this webinar, you’ll learn how model distillation can cut inference costs by up to 70% while maintaining enterprise-level performance.

6

19

156

vLLM

@vllm_project

4 days

🎉Congrats to the @Zai_org team on the launch of GLM-4.6V and GLM-4.6V-Flash — with day-0 serving support in vLLM Recipes for teams who want to run them on their own GPUs. GLM-4.6V focuses on high-quality multimodal reasoning with long context and native tool/function calling,

Z.ai

@Zai_org

4 days

GLM-4.6V Series is here🚀 - GLM-4.6V (106B): flagship vision-language model with 128K context - GLM-4.6V-Flash (9B): ultra-fast, lightweight version for local and low-latency workloads First-ever native Function Calling in the GLM vision model family Weights:

17

47

525

Docker

@Docker

7 days

Big news for AI builders. Ministral 3, DeepSeek-V3.2, and vLLM v0.12.0 are now available on Docker Model Runner! Run frontier-class, open-weights models with one command Read the announcement blog here:

docker.com

Run Ministral 3 and DeepSeek-V3.2 on Docker Model Runner with vLLM 0.12. Test-drive the latest open-weights models as soon as they’re released.

5

23

115

vLLM

@vllm_project

7 days

Beyond the engine, v0.12.0 ships EAGLE speculative decoding improvements, new model families, NVFP4 / W4A8 / AWQ quantization options, and tuned kernels across NVIDIA, AMD ROCm, and CPU. We recommend building new images with PyTorch 2.9.0 + CUDA 12.9, validating on staging

github.com

vLLM v0.12.0 Release Notes Highlights Highlights This release features 474 commits from 213 contributors (57 new)！ Breaking Changes: This release includes PyTorch 2.9.0 upgrade (CUDA 12.9), V0 depr...

0

8

REX Shares

@REXShares

7 days

Trade with Leverage? Trade with T-REX!

0

2

10

vLLM

@vllm_project

7 days

Two engine paths are now available for early adopters: GPU Model Runner V2 (refactored GPU execution with GPU-persistent block tables + a Triton-native sampler) and prefill context parallel (PCP) groundwork for long-context prefill. Both are experimental, disabled by default, and

1

0

7

vLLM

@vllm_project

7 days

📢vLLM v0.12.0 is now available. For inference teams running vLLM at the center of their stack, this release refreshes the engine, extends long-context and speculative decoding capabilities, and moves us to a PyTorch 2.9.0 / CUDA 12.9 baseline for future work.

4

20

148

vLLM

@vllm_project

8 days

🚀 vLLM now offers an optimized inference recipe for DeepSeek-V3.2. ⚙️ Startup details Run vLLM with DeepSeek-specific components: --tokenizer-mode deepseek_v32 \ --tool-call-parser deepseek_v32 🧰 Usage tips Enable thinking mode in vLLM: –

DeepSeek

@deepseek_ai

11 days

🚀 Launching DeepSeek-V3.2 & DeepSeek-V3.2-Speciale — Reasoning-first models built for agents! 🔹 DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web & API. 🔹 DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now. 📄 Tech

12

48

411

vLLM

@vllm_project

9 days

We’re taking CUDA debugging to the next level. 🚀 Building on our previous work with CUDA Core Dumps, we are releasing a new guide on tracing hanging and complicated kernels down to the source code. As kernels get more complex (deep inlining, async memory access), standard

vLLM

@vllm_project

4 months

Have you ever felt you are developing cuda kernels and your tests often run into illegal memory access (IMA for short) and you have no idea how to debug? We have collaborated with the @nvidia team to investigate how cuda core dump can help, check out the blogpost to learn more!

1

37

263

Microsoft Research

@MSFTResearch

20 hours

The latest Microsoft Research Forum episode is now available on-demand. Explore purposeful research and its real-world impact.

4

3

36

vLLM

@vllm_project

9 days

🤝 Proud to share the first production-ready vLLM plugin for Gaudi, developed in close collaboration with the Intel team and fully aligned with upstream vLLM. 🔧 This release is validated and ready for deployment, with support for the latest vLLM version coming soon. 📘 The

4

14

107

vLLM

@vllm_project

9 days

LLM agents are powerful but can be slow at scale. @Snowflake's model-free SuffixDecoding from Arctic Inference now runs natively in vLLM, beating tuned N-gram speculation across concurrency levels while keeping CPU and memory overhead in check. Quick Start in vLLM:

Aurick Qiao

@aurickq

10 days

Suffix Decoding is at #NeurIPS2025 as a 🏅spotlight! It accelerates LLM inference for coding, agents, and RL. We also optimized its speculation speed by 7.4x and merged it into vLLM (incoming to SGLang). Talk to @GabrieleOliaro or me at poster #816 Friday 11am! Links in🧵

2

12

108

vLLM

@vllm_project

10 days

🎉 Congratulations to the Mistral team on launching the Mistral 3 family! We’re proud to share that @MistralAI, @NVIDIAAIDev, @RedHat_AI, and vLLM worked closely together to deliver full Day-0 support for the entire Mistral 3 lineup. This collaboration enabled: • NVFP4

Mistral AI

@MistralAI

10 days

Introducing the Mistral 3 family of models: Frontier intelligence at all sizes. Apache 2.0. Details in 🧵

8

42

494

vLLM

@vllm_project

11 days

You can also use the Gradio example provided in the Omni repository. https://t.co/kXsRBXGAHf

0

1

7

Cutago

@CutagoCo

5 hours

Most Wanted Thermal Gloves! Say goodbye to freezing hands. Water and windproof. Great for use with touch screen devices

1

0

2

vLLM

@vllm_project

11 days

Currently only supports Qwen-Omni and Qwen-Image, and this is just the beginning. More models are coming. https://t.co/gKKY0uhtNv https://t.co/cmkFjvCN3y

vLLM

@vllm_project

11 days

More inference workloads now mix autoregressive and diffusion models in a single pipeline to process and generate multiple modalities - text, image, audio, and video. Today we’re releasing vLLM-Omni: an open-source framework that extends vLLM’s easy, fast, and cost-efficient

3

21

123

vLLM

@vllm_project

11 days

More inference workloads now mix autoregressive and diffusion models in a single pipeline to process and generate multiple modalities - text, image, audio, and video. Today we’re releasing vLLM-Omni: an open-source framework that extends vLLM’s easy, fast, and cost-efficient

blog.vllm.ai

We are excited to announce the official release of vLLM-Omni, a major extension of the vLLM ecosystem designed to support the next generation of AI: omni-modality models.

5

47

278

Lysandre

@LysandreJik

11 days

Transformers v5's first release candidate is out 🔥 The biggest release of my life. It's been five years since the last major (v4). From 20 architectures to 400, 20k daily downloads to 3 million. The release is huge, w/ tokenization (no slow tokenizers!), modeling & processing.

20

89

566

vLLM

@vllm_project

12 days

Love this: a community contributor built vLLM Playground to make inferencing visible, interactive, and experiment-friendly. From visual config toggles to automatic command generation, from GPU/M-chip support to GuideLLM benchmarking + LLMCompressor integration — it brings the

6

42

305

Tebaha

@TebahaCo

3 days

Expand your outlets instantly with this rotating wall plug extender. Keep chargers, appliances, and devices organized while freeing space in tight corners. A simple way to upgrade any room. Discover it today. 👉 Ready to buy? Just click the video or the link.

0

1

38

vLLM

@vllm_project

15 days

vLLM is proud to support @PrimeIntellect 's post-training of the INTELLECT-3 model🥰

Prime Intellect

@PrimeIntellect

15 days

Introducing INTELLECT-3: Scaling RL to a 100B+ MoE model on our end-to-end stack Achieving state-of-the-art performance for its size across math, code and reasoning Built using the same tools we put in your hands, from environments & evals, RL frameworks, sandboxes & more

3

15

161

Red Hat AI

@RedHat_AI

17 days

Interested in how NVIDIA Nemotron-H is being optimized for high performance inference in @vllm_project? Join @RedHat and @NVIDIAAI next week as we cover the Nemotron-H architecture, vLLM support, optimized MoE kernels, async scheduling, and new nsys profiles. Join links below 👇

1

9

35