addvin Profile Banner
brian stevens Profile
brian stevens

@addvin

Followers
5K
Following
168
Media
59
Statuses
743

CEO, Neural Magic. Ex VP, CTO of Google Cloud and EVP, CTO of Red Hat, RPI and UNH alumn, marathoner, ironman, ADK MT 46er.

Joined January 2009
Don't wanna be here? Send us removal request.
@addvin
brian stevens
2 months
Introducing llm-d on stage at Red Hat Summit was truly a privilege . .
@RedHat_AI
Red Hat AI
2 months
LLM inference is too slow, too expensive, and too hard to scale. 🚨 Introducing llm-d, a Kubernetes-native distributed inference framework, to change that—using vLLM (@vllm_project), smart scheduling, and disaggregated compute. Here’s how it works—and how you can use it today:.
1
8
30
@addvin
brian stevens
2 months
RT @RedHat_AI: Thanks to the @lmcache team for joining forces with Red Hat on llm-d!. llm-d is a new open source project for scalable, effi….
0
8
0
@addvin
brian stevens
2 months
RT @sparkycollier: Really excited to see the emergence of llm-d @addvin ! Inference is the biggest workload in human history and the open s….
0
2
0
@addvin
brian stevens
2 months
RT @NVIDIAAIDev: The llm-d project is a major step forward for the #opensource AI ecosystem, and we are proud to be one of the founding con….
0
17
0
@addvin
brian stevens
3 months
And was great to see the Red Hat and Google effort announced by my friend the brilliant Amin Vahdat.
@woosuk_k
Woosuk Kwon
3 months
Huge congrats to all the @googlecloud and @RedHat_AI team members who drove this effort!.
0
0
6
@addvin
brian stevens
5 months
RT @neuralmagic: DeepSeek’s Open Source Week drops A LOT of exciting goodies! We’re hosting vLLM Office Hours tomorrow—learn what they are,….
0
2
0
@addvin
brian stevens
6 months
RT @matthicksj: At @RedHat, we believe the future of AI is open. That's why I'm incredibly excited about our acquisition of @NeuralMagic. T….
0
28
0
@addvin
brian stevens
6 months
Today it become official, Neural Magic now a part of Red Hat.
@RedHat
Red Hat
6 months
Today, Red Hat completed the acquisition of @NeuralMagic, a pioneer in software and algorithms that accelerate #GenAI inference workloads. Read how we are accelerating our vision for #AI’s future:
2
8
38
@addvin
brian stevens
7 months
RT @neuralmagic: If you are at #NeurIPS2024 this week, stop by the Neural Magic booth #307 and talk to us about the @vllm_project! vLLM cor….
0
1
0
@addvin
brian stevens
8 months
RT @scaleml: For our last seminar of the year we will end with Lucas Wilkinson from @neuralmagic presenting! . Machete: a cutting-edge mixe….
0
4
0
@addvin
brian stevens
8 months
I’m thrilled to announce that Neural Magic has signed a definitive agreement to join forces with Red Hat, Inc. At Neural Magic our vision is that the future of AI is open, and we have been on a mission to enable enterprises to capture the powerful innovation from AI, while at
Tweet media one
17
35
128
@addvin
brian stevens
9 months
Quantization of LLM models is critical for efficient deployments. But how to avoid any negative impact of quantization on model capability? Our latest research across Llama variants will serve as a great guide.
@_EldarKurtic
Eldar Kurtić
9 months
vLLM + Quantization: We investigated impact of quantization across all Llama sizes to come up with a set of practical guidelines for deployment across various use cases and GPU architectures in @vllm_project. There are some interesting findings relative to "well-known" things: 👇
Tweet media one
Tweet media two
0
1
4
@addvin
brian stevens
10 months
Quantized versions of Llama-3.2 now available . .
@_EldarKurtic
Eldar Kurtić
10 months
@AIatMeta just released new Llama-3.2 models (~3h ago), and as usual, our team at @neuralmagic was quick to quantize them to FP8 with llm-compressor for even more efficient inference with vLLM!.1. 2.
0
0
0
@addvin
brian stevens
11 months
RT @neuralmagic: Sparse-Marlin is here and integrated into @vllm_project! This GPU-optimized kernel accelerates matrix multiplication with….
0
20
0
@addvin
brian stevens
11 months
Very cool!.
@charles_irl
Charles 🎉 Frye
11 months
Will share more technical details in the coming days. Initial estimate is we got @neuralmagic's FP8 of LLaMA 3.1 405B on @vllm_project to process 2,500,000 tokens per hour per 8xA100-80 node on @modal_labs, simulating 1995 at a rate of 10 days per hour (240x real time).
Tweet media one
0
1
11
@addvin
brian stevens
1 year
RT @neuralmagic: vLLM is the leading open-source inference server with 24k GitHub stars. Join us for bi-weekly vLLM Office Hours to learn a….
0
3
0
@addvin
brian stevens
1 year
RT @neuralmagic: Our team has been busy releasing quantized Llama 3.1 models, thoroughly evaluated to ensure optimal performance in #vLLM.….
0
33
0
@addvin
brian stevens
1 year
RT @rsumbaly: Great to see the community moving fast to adapt Llama 3.1 to their needs. This is the beauty of open-source and key part of w….
0
5
0
@addvin
brian stevens
1 year
RT @markurtz_: 🧵1/4. Our Llama 3.1 compression project is underway, aiming for cost-effective and sustainable deployments without compromis….
0
2
0
@addvin
brian stevens
1 year
RT @vllm_project: ⚡Llama 3.1 series are uniquely challenging due to long context and large size. We want to thank @neuralmagic for their co….
0
3
0