
Bento - Run Inference at Scale
@bentomlai
Followers
2K
Following
499
Media
166
Statuses
660
๐ฑ Inference Platform built for speed and control. Join our community ๐ https://t.co/qiBMpvzX9O
San Francisco, CA
Joined May 2017
๐ Weโre growing the BentoML team! Come join the journey! ๐ง Forward Deployed Engineer: Work directly with customers to design, build, and deploy real-world LLM apps. Youโll own projects end-to-end. Perfect for engineers who thrive at the intersection of product, engineering,
0
0
3
๐ Introducing llm-optimizer: an open-source tool for smart LLM benchmarking and inference performance tuning. Tuning LLM performance is notoriously tricky. For most AI teams, this process means endless trial and error. And even then, itโs hard to know which setup is actually
1
5
10
#NVIDIA may lead the #GPU market today, but AMDโs MI-series (MI325X, MI300X, MI350X) has become an important part of the conversation. https://t.co/DSpUMv8y9B With high-bandwidth memory and growing ecosystem support, #AMD GPUs are becoming a serious option for modern #AI
bentoml.com
Explore AMD GPUs for AI inference. Learn about MI250X, MI300X, MI350X, pricing, performance, and how they compare to NVIDIA for AI and HPC.
0
0
2
We benchmarked speculative decoding on H100 GPUs with two tensor parallelism setups (TP=1 vs TP=2). Key takeaways: - Throughput: Speculative decoding consistently outperformed baseline when concurrency <35. At higher loads, TP=2 scaled much better. - TTFT: TP=1 saw a sharp
0
1
8
What are NVIDIA Data Center GPUs? https://t.co/Qt3wHtBCIx Most people know NVIDIA from gaming, like the popular GeForce series, but those arenโt built for enterprise-scale AI. For #AIInference, what you really need are NVIDIA Data Center GPUs. In our latest blog, we cover: โ
bentoml.com
Understand NVIDIA data center GPUs for AI inference. Compare T4, L4, A100, H100, H200, and B200 on use cases, memory, and pricing to choose the right GPU.
1
0
3
๐ Check out this great post by Titus Lim Hsien Yong https://t.co/0NPO03N1N9 ๐ Skip the ops-heavy stack โ Kubernetes, Terraform, Ansible, etc. With #BentoCloud, you can deploy any open and custom #LLM in minutes, with fast autoscaling, comprehensive observability, and tailored
medium.com
A step-by-step guide to serving vLLM models with replicas in pure Python using BentoML and Bento Cloud for scalable GenAI apps.
0
0
2
๐ DeepSeek just dropped V3.1, a major update that fuses the strengths of V3 and R1 into a single hybrid model. Takeaways: โ๏ธ 671B parameters (37B activated), 128K context length ๐ง Hybrid thinking mode: One model, two modes (thinking + non-thinking) ๐ ๏ธ Smarter tool use:
0
0
5
LLM leaderboards rank the top LLMs, but a high score doesnโt guarantee the best model for your use case. To find the optimal setup for inference, you need custom benchmarks tailored to your hardware, framework, and workload. This often means balancing trade-offs in throughput,
0
0
4
Everyoneโs talking about speculative decoding for faster #LLMinference. But do you know: the actual speedup ๐ฑ๐ฒ๐ฝ๐ฒ๐ป๐ฑ๐ ๐ต๐ฒ๐ฎ๐๐ถ๐น๐ ๐ผ๐ป ๐๐ต๐ฒ ๐๐ฅ๐๐๐ง ๐บ๐ผ๐ฑ๐ฒ๐น ๐๐ผ๐ ๐๐๐ฒ. Three metrics can be used to evaluate the performance of the draft model: - Acceptance rate
0
0
3
๐ Thank you, #Ai4 2025! It was amazing to meet so many of you and connect in person! Special thanks to everyone who joined our CEO @chaoyu_'s session on #InferenceOps! If you missed it or want to dive deeper, check out our blog post ๐ https://t.co/0UebOv64If ๐ย Highlights
0
0
2
๐ค Catch our CEO @chaoyu_ at #Ai4 2025 today! ๐ Aug 13, 12:20 PM ๐ Room 317 (Level 3) โ CAIO, CIO & IT Leaders track Heโll share why #InferenceOps is the strategic foundation for scaling enterprise #AI and how organizations can run mission-critical AI workloads with speed,
1
0
1
๐ธ Live from #Ai4 2025 in Las Vegas! ๐ฑ The Bento team is here at Booth #346, showing how we help enterprises run, scale, and optimize inference for mission-critical AI workloads. Come say hi ๐ If youโre here, tag us in your photos! @Ai4Conferences #AIInference #BentoML
0
0
2
You can get up to 3ร faster LLM inference with speculative decoding, but only if youโre using the RIGHT draft model. https://t.co/pmVziNL550 Why speculative decoding? LLM inference is slow by design. Because of their autoregressive nature, every new token requires a full forward
bentoml.com
Learn what speculative decoding is, how it speeds up LLM inference, why draft model choice matters, and when training your own delivers up to 3ร performance gains.
0
0
2
๐คย What is KV cache offloading and why does it matter for LLM inference? #LLMs use the KV cache to accelerate inference speed by avoiding recompute for KV during the prefill phase, however cache grows linearly with requests and has to be evicted to due to GPU memory limit. KV
1
2
14
๐ Weโre heading to @Ai4Conferences 2025 in Las Vegas and proud to be a sponsor! Join us August 11โ13 at North Americaโs leading #AI industry event. ๐ฑย Donโt forget to visit Booth #346 to meet the Bento team! See how we help enterprises around the world run, scale, and optimize
0
1
2
Many enterprises are moving LLM workloads on-prem for security, performance, and cost control. But most overlook one critical piece of the stack. They need more than just GPUs and Kubernetes. They also need a purpose-built inference platform layer. Without it, theyโll likely run
0
3
7