Bento - Run Inference at Scale @bentomlai X Profile

Bento - Run Inference at Scale

@bentomlai

Followers

2K

Following

499

Media

166

Statuses

660

🍱 Inference Platform built for speed and control. Join our community 👉 https://t.co/qiBMpvzX9O

https://t.co/NioLrmkWr9

San Francisco, CA

Joined May 2017

Don't wanna be here? Send us removal request.

Bento - Run Inference at Scale

@bentomlai

6 days

🙌 We’re growing the BentoML team! Come join the journey! 🔧 Forward Deployed Engineer: Work directly with customers to design, build, and deploy real-world LLM apps. You’ll own projects end-to-end. Perfect for engineers who thrive at the intersection of product, engineering,

0

3

Bento - Run Inference at Scale

@bentomlai

4 days

🚀 Introducing llm-optimizer: an open-source tool for smart LLM benchmarking and inference performance tuning. Tuning LLM performance is notoriously tricky. For most AI teams, this process means endless trial and error. And even then, it’s hard to know which setup is actually

1

5

10

Bento - Run Inference at Scale

@bentomlai

10 days

#NVIDIA may lead the #GPU market today, but AMD’s MI-series (MI325X, MI300X, MI350X) has become an important part of the conversation. https://t.co/DSpUMv8y9B With high-bandwidth memory and growing ecosystem support, #AMD GPUs are becoming a serious option for modern #AI

bentoml.com

Explore AMD GPUs for AI inference. Learn about MI250X, MI300X, MI350X, pricing, performance, and how they compare to NVIDIA for AI and HPC.

0

2

Bento - Run Inference at Scale

@bentomlai

13 days

We benchmarked speculative decoding on H100 GPUs with two tensor parallelism setups (TP=1 vs TP=2). Key takeaways: - Throughput: Speculative decoding consistently outperformed baseline when concurrency <35. At higher loads, TP=2 scaled much better. - TTFT: TP=1 saw a sharp

0

1

8

Bento - Run Inference at Scale

@bentomlai

18 days

What are NVIDIA Data Center GPUs? https://t.co/Qt3wHtBCIx Most people know NVIDIA from gaming, like the popular GeForce series, but those aren’t built for enterprise-scale AI. For #AIInference, what you really need are NVIDIA Data Center GPUs. In our latest blog, we cover: ✅

bentoml.com

Understand NVIDIA data center GPUs for AI inference. Compare T4, L4, A100, H100, H200, and B200 on use cases, memory, and pricing to choose the right GPU.

1

0

3

Bento - Run Inference at Scale

@bentomlai

21 days

🚀 Check out this great post by Titus Lim Hsien Yong https://t.co/0NPO03N1N9 👋 Skip the ops-heavy stack — Kubernetes, Terraform, Ansible, etc. With #BentoCloud, you can deploy any open and custom #LLM in minutes, with fast autoscaling, comprehensive observability, and tailored

medium.com

A step-by-step guide to serving vLLM models with replicas in pure Python using BentoML and Bento Cloud for scalable GenAI apps.

0

2

Bento - Run Inference at Scale

@bentomlai

25 days

🚀 DeepSeek just dropped V3.1, a major update that fuses the strengths of V3 and R1 into a single hybrid model. Takeaways: ⚙️ 671B parameters (37B activated), 128K context length 🧠 Hybrid thinking mode: One model, two modes (thinking + non-thinking) 🛠️ Smarter tool use:

0

5

Bento - Run Inference at Scale

@bentomlai

26 days

LLM leaderboards rank the top LLMs, but a high score doesn’t guarantee the best model for your use case. To find the optimal setup for inference, you need custom benchmarks tailored to your hardware, framework, and workload. This often means balancing trade-offs in throughput,

0

4

Bento - Run Inference at Scale

@bentomlai

27 days

Everyone’s talking about speculative decoding for faster #LLMinference. But do you know: the actual speedup 𝗱𝗲𝗽𝗲𝗻𝗱𝘀 𝗵𝗲𝗮𝘃𝗶𝗹𝘆 𝗼𝗻 𝘁𝗵𝗲 𝗗𝗥𝗔𝗙𝗧 𝗺𝗼𝗱𝗲𝗹 𝘆𝗼𝘂 𝘂𝘀𝗲. Three metrics can be used to evaluate the performance of the draft model: - Acceptance rate

0

3

Bento - Run Inference at Scale

@bentomlai

1 month

🙏 Thank you, #Ai4 2025! It was amazing to meet so many of you and connect in person! Special thanks to everyone who joined our CEO @chaoyu_'s session on #InferenceOps! If you missed it or want to dive deeper, check out our blog post 👉 https://t.co/0UebOv64If 👀 Highlights

0

2

Bento - Run Inference at Scale

@bentomlai

1 month

🎤 Catch our CEO @chaoyu_ at #Ai4 2025 today! 🗓 Aug 13, 12:20 PM 📍 Room 317 (Level 3) – CAIO, CIO & IT Leaders track He’ll share why #InferenceOps is the strategic foundation for scaling enterprise #AI and how organizations can run mission-critical AI workloads with speed,

1

0

1

Bento - Run Inference at Scale

@bentomlai

1 month

📸 Live from #Ai4 2025 in Las Vegas! 🍱 The Bento team is here at Booth #346, showing how we help enterprises run, scale, and optimize inference for mission-critical AI workloads. Come say hi 👋 If you’re here, tag us in your photos! @Ai4Conferences #AIInference #BentoML

0

2

Bento - Run Inference at Scale

@bentomlai

1 month

You can get up to 3× faster LLM inference with speculative decoding, but only if you’re using the RIGHT draft model. https://t.co/pmVziNL550 Why speculative decoding? LLM inference is slow by design. Because of their autoregressive nature, every new token requires a full forward

bentoml.com

Learn what speculative decoding is, how it speeds up LLM inference, why draft model choice matters, and when training your own delivers up to 3× performance gains.

0

2

Bento - Run Inference at Scale

@bentomlai

1 month

Try LMCache

github.com

Supercharge Your LLM with the Fastest KV Cache Layer - LMCache/LMCache

0

Bento - Run Inference at Scale

@bentomlai

1 month

🤔 What is KV cache offloading and why does it matter for LLM inference? #LLMs use the KV cache to accelerate inference speed by avoiding recompute for KV during the prefill phase, however cache grows linearly with requests and has to be evicted to due to GPU memory limit. KV

1

2

14

Bento - Run Inference at Scale

@bentomlai

1 month

@OpenAI 20b:

0

1

Bento - Run Inference at Scale

@bentomlai

1 month

🚀 @OpenAI just released two powerful open-weight reasoning models: gpt-oss-120b and gpt-oss-20b. #BentoML supports them from day zero! Key takeaways: ✅ 120b matches OpenAI o4-mini on core benchmarks ✅ 20b rivals o3-mini, ideal for local & on-device inference ✅

1

Bento - Run Inference at Scale

@bentomlai

2 months

🚀 We’re heading to @Ai4Conferences 2025 in Las Vegas and proud to be a sponsor! Join us August 11–13 at North America’s leading #AI industry event. 🍱 Don’t forget to visit Booth #346 to meet the Bento team! See how we help enterprises around the world run, scale, and optimize

0

1

2

Bento - Run Inference at Scale

@bentomlai

2 months

Many enterprises are moving LLM workloads on-prem for security, performance, and cost control. But most overlook one critical piece of the stack. They need more than just GPUs and Kubernetes. They also need a purpose-built inference platform layer. Without it, they’ll likely run

0

3

7