bentomlai Profile Banner
Bento - Run Inference at Scale Profile
Bento - Run Inference at Scale

@bentomlai

Followers
2K
Following
499
Media
166
Statuses
660

๐Ÿฑ Inference Platform built for speed and control. Join our community ๐Ÿ‘‰ https://t.co/qiBMpvzX9O

San Francisco, CA
Joined May 2017
Don't wanna be here? Send us removal request.
@bentomlai
Bento - Run Inference at Scale
6 days
๐Ÿ™Œ Weโ€™re growing the BentoML team! Come join the journey! ๐Ÿ”ง Forward Deployed Engineer: Work directly with customers to design, build, and deploy real-world LLM apps. Youโ€™ll own projects end-to-end. Perfect for engineers who thrive at the intersection of product, engineering,
0
0
3
@bentomlai
Bento - Run Inference at Scale
4 days
๐Ÿš€ Introducing llm-optimizer: an open-source tool for smart LLM benchmarking and inference performance tuning. Tuning LLM performance is notoriously tricky. For most AI teams, this process means endless trial and error. And even then, itโ€™s hard to know which setup is actually
1
5
10
@bentomlai
Bento - Run Inference at Scale
10 days
#NVIDIA may lead the #GPU market today, but AMDโ€™s MI-series (MI325X, MI300X, MI350X) has become an important part of the conversation. https://t.co/DSpUMv8y9B With high-bandwidth memory and growing ecosystem support, #AMD GPUs are becoming a serious option for modern #AI
Tweet card summary image
bentoml.com
Explore AMD GPUs for AI inference. Learn about MI250X, MI300X, MI350X, pricing, performance, and how they compare to NVIDIA for AI and HPC.
0
0
2
@bentomlai
Bento - Run Inference at Scale
13 days
We benchmarked speculative decoding on H100 GPUs with two tensor parallelism setups (TP=1 vs TP=2). Key takeaways: - Throughput: Speculative decoding consistently outperformed baseline when concurrency <35. At higher loads, TP=2 scaled much better. - TTFT: TP=1 saw a sharp
Tweet media one
Tweet media two
0
1
8
@bentomlai
Bento - Run Inference at Scale
18 days
What are NVIDIA Data Center GPUs? https://t.co/Qt3wHtBCIx Most people know NVIDIA from gaming, like the popular GeForce series, but those arenโ€™t built for enterprise-scale AI. For #AIInference, what you really need are NVIDIA Data Center GPUs. In our latest blog, we cover: โœ…
Tweet card summary image
bentoml.com
Understand NVIDIA data center GPUs for AI inference. Compare T4, L4, A100, H100, H200, and B200 on use cases, memory, and pricing to choose the right GPU.
1
0
3
@bentomlai
Bento - Run Inference at Scale
21 days
๐Ÿš€ Check out this great post by Titus Lim Hsien Yong https://t.co/0NPO03N1N9 ๐Ÿ‘‹ Skip the ops-heavy stack โ€” Kubernetes, Terraform, Ansible, etc. With #BentoCloud, you can deploy any open and custom #LLM in minutes, with fast autoscaling, comprehensive observability, and tailored
Tweet card summary image
medium.com
A step-by-step guide to serving vLLM models with replicas in pure Python using BentoML and Bento Cloud for scalable GenAI apps.
0
0
2
@bentomlai
Bento - Run Inference at Scale
25 days
๐Ÿš€ DeepSeek just dropped V3.1, a major update that fuses the strengths of V3 and R1 into a single hybrid model. Takeaways: โš™๏ธ 671B parameters (37B activated), 128K context length ๐Ÿง  Hybrid thinking mode: One model, two modes (thinking + non-thinking) ๐Ÿ› ๏ธ Smarter tool use:
Tweet media one
0
0
5
@bentomlai
Bento - Run Inference at Scale
26 days
LLM leaderboards rank the top LLMs, but a high score doesnโ€™t guarantee the best model for your use case. To find the optimal setup for inference, you need custom benchmarks tailored to your hardware, framework, and workload. This often means balancing trade-offs in throughput,
Tweet media one
0
0
4
@bentomlai
Bento - Run Inference at Scale
27 days
Everyoneโ€™s talking about speculative decoding for faster #LLMinference. But do you know: the actual speedup ๐—ฑ๐—ฒ๐—ฝ๐—ฒ๐—ป๐—ฑ๐˜€ ๐—ต๐—ฒ๐—ฎ๐˜ƒ๐—ถ๐—น๐˜† ๐—ผ๐—ป ๐˜๐—ต๐—ฒ ๐——๐—ฅ๐—”๐—™๐—ง ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น ๐˜†๐—ผ๐˜‚ ๐˜‚๐˜€๐—ฒ. Three metrics can be used to evaluate the performance of the draft model: - Acceptance rate
Tweet media one
0
0
3
@bentomlai
Bento - Run Inference at Scale
1 month
๐Ÿ™ Thank you, #Ai4 2025! It was amazing to meet so many of you and connect in person! Special thanks to everyone who joined our CEO @chaoyu_'s session on #InferenceOps! If you missed it or want to dive deeper, check out our blog post ๐Ÿ‘‰ https://t.co/0UebOv64If ๐Ÿ‘€ย Highlights
Tweet media one
Tweet media two
Tweet media three
Tweet media four
0
0
2
@bentomlai
Bento - Run Inference at Scale
1 month
๐ŸŽค Catch our CEO @chaoyu_ at #Ai4 2025 today! ๐Ÿ—“ Aug 13, 12:20 PM ๐Ÿ“ Room 317 (Level 3) โ€“ CAIO, CIO & IT Leaders track Heโ€™ll share why #InferenceOps is the strategic foundation for scaling enterprise #AI and how organizations can run mission-critical AI workloads with speed,
Tweet media one
1
0
1
@bentomlai
Bento - Run Inference at Scale
1 month
๐Ÿ“ธ Live from #Ai4 2025 in Las Vegas! ๐Ÿฑ The Bento team is here at Booth #346, showing how we help enterprises run, scale, and optimize inference for mission-critical AI workloads. Come say hi ๐Ÿ‘‹ If youโ€™re here, tag us in your photos! @Ai4Conferences #AIInference #BentoML
Tweet media one
Tweet media two
0
0
2
@bentomlai
Bento - Run Inference at Scale
1 month
You can get up to 3ร— faster LLM inference with speculative decoding, but only if youโ€™re using the RIGHT draft model. https://t.co/pmVziNL550 Why speculative decoding? LLM inference is slow by design. Because of their autoregressive nature, every new token requires a full forward
Tweet card summary image
bentoml.com
Learn what speculative decoding is, how it speeds up LLM inference, why draft model choice matters, and when training your own delivers up to 3ร— performance gains.
0
0
2
@bentomlai
Bento - Run Inference at Scale
1 month
๐Ÿค”ย What is KV cache offloading and why does it matter for LLM inference? #LLMs use the KV cache to accelerate inference speed by avoiding recompute for KV during the prefill phase, however cache grows linearly with requests and has to be evicted to due to GPU memory limit. KV
1
2
14
@bentomlai
Bento - Run Inference at Scale
1 month
@OpenAI 20b:
0
0
1
@bentomlai
Bento - Run Inference at Scale
1 month
๐Ÿš€ @OpenAI just released two powerful open-weight reasoning models: gpt-oss-120b and gpt-oss-20b. #BentoML supports them from day zero! Key takeaways: โœ…ย 120b matches OpenAI o4-mini on core benchmarks โœ… 20b rivals o3-mini, ideal for local & on-device inference โœ…
1
1
1
@bentomlai
Bento - Run Inference at Scale
2 months
๐Ÿš€ Weโ€™re heading to @Ai4Conferences 2025 in Las Vegas and proud to be a sponsor! Join us August 11โ€“13 at North Americaโ€™s leading #AI industry event. ๐Ÿฑย Donโ€™t forget to visit Booth #346 to meet the Bento team! See how we help enterprises around the world run, scale, and optimize
Tweet media one
0
1
2
@bentomlai
Bento - Run Inference at Scale
2 months
Many enterprises are moving LLM workloads on-prem for security, performance, and cost control. But most overlook one critical piece of the stack. They need more than just GPUs and Kubernetes. They also need a purpose-built inference platform layer. Without it, theyโ€™ll likely run
Tweet media one
0
3
7