
brian stevens
@addvin
Followers
5K
Following
168
Media
59
Statuses
743
CEO, Neural Magic. Ex VP, CTO of Google Cloud and EVP, CTO of Red Hat, RPI and UNH alumn, marathoner, ironman, ADK MT 46er.
Joined January 2009
Introducing llm-d on stage at Red Hat Summit was truly a privilege . .
LLM inference is too slow, too expensive, and too hard to scale. 🚨 Introducing llm-d, a Kubernetes-native distributed inference framework, to change that—using vLLM (@vllm_project), smart scheduling, and disaggregated compute. Here’s how it works—and how you can use it today:.
1
8
30
RT @RedHat_AI: Thanks to the @lmcache team for joining forces with Red Hat on llm-d!. llm-d is a new open source project for scalable, effi….
0
8
0
RT @sparkycollier: Really excited to see the emergence of llm-d @addvin ! Inference is the biggest workload in human history and the open s….
0
2
0
RT @NVIDIAAIDev: The llm-d project is a major step forward for the #opensource AI ecosystem, and we are proud to be one of the founding con….
0
17
0
And was great to see the Red Hat and Google effort announced by my friend the brilliant Amin Vahdat.
Huge congrats to all the @googlecloud and @RedHat_AI team members who drove this effort!.
0
0
6
RT @neuralmagic: DeepSeek’s Open Source Week drops A LOT of exciting goodies! We’re hosting vLLM Office Hours tomorrow—learn what they are,….
0
2
0
RT @matthicksj: At @RedHat, we believe the future of AI is open. That's why I'm incredibly excited about our acquisition of @NeuralMagic. T….
0
28
0
Today it become official, Neural Magic now a part of Red Hat.
Today, Red Hat completed the acquisition of @NeuralMagic, a pioneer in software and algorithms that accelerate #GenAI inference workloads. Read how we are accelerating our vision for #AI’s future:
2
8
38
RT @neuralmagic: If you are at #NeurIPS2024 this week, stop by the Neural Magic booth #307 and talk to us about the @vllm_project! vLLM cor….
0
1
0
RT @scaleml: For our last seminar of the year we will end with Lucas Wilkinson from @neuralmagic presenting! . Machete: a cutting-edge mixe….
0
4
0
Quantization of LLM models is critical for efficient deployments. But how to avoid any negative impact of quantization on model capability? Our latest research across Llama variants will serve as a great guide.
vLLM + Quantization: We investigated impact of quantization across all Llama sizes to come up with a set of practical guidelines for deployment across various use cases and GPU architectures in @vllm_project. There are some interesting findings relative to "well-known" things: 👇
0
1
4
Quantized versions of Llama-3.2 now available . .
@AIatMeta just released new Llama-3.2 models (~3h ago), and as usual, our team at @neuralmagic was quick to quantize them to FP8 with llm-compressor for even more efficient inference with vLLM!.1. 2.
0
0
0
RT @neuralmagic: Sparse-Marlin is here and integrated into @vllm_project! This GPU-optimized kernel accelerates matrix multiplication with….
0
20
0
Very cool!.
Will share more technical details in the coming days. Initial estimate is we got @neuralmagic's FP8 of LLaMA 3.1 405B on @vllm_project to process 2,500,000 tokens per hour per 8xA100-80 node on @modal_labs, simulating 1995 at a rate of 10 days per hour (240x real time).
0
1
11
RT @neuralmagic: vLLM is the leading open-source inference server with 24k GitHub stars. Join us for bi-weekly vLLM Office Hours to learn a….
0
3
0
RT @neuralmagic: Our team has been busy releasing quantized Llama 3.1 models, thoroughly evaluated to ensure optimal performance in #vLLM.….
0
33
0
RT @markurtz_: 🧵1/4. Our Llama 3.1 compression project is underway, aiming for cost-effective and sustainable deployments without compromis….
0
2
0
RT @vllm_project: ⚡Llama 3.1 series are uniquely challenging due to long context and large size. We want to thank @neuralmagic for their co….
0
3
0