bentomlai Profile Banner
Bento - Run Inference at Scale Profile
Bento - Run Inference at Scale

@bentomlai

Followers
2K
Following
476
Media
155
Statuses
631

๐Ÿฑ Inference platform for self-hosting LLMs with tailored performance optimization and up to 6ร— cost savings. Join Community ๐Ÿ‘‰ https://t.co/qiBMpvzX9O

San Francisco, CA
Joined May 2017
Don't wanna be here? Send us removal request.
@bentomlai
Bento - Run Inference at Scale
2 days
๐Ÿš€ Our LLM Inference Handbook is out!. If youโ€™ve worked with LLMs, youโ€™ve probably felt this: The knowledge you need is often scattered across papers, GitHub issues, vendor blogs, and Discord threads. Everyone talks about inference, but no one ties it all together. You end up
Tweet media one
0
2
9
@bentomlai
Bento - Run Inference at Scale
6 days
Modern #AI agents donโ€™t just generate text. They write and run code. But what happens when:. โš ๏ธ A prompt injection triggers harmful logic?.โš ๏ธ An agent runs unknown scripts from a Git repo?.โš ๏ธ It connects to untrusted APIs?. Letting agents run arbitrary code without guardrails is
Tweet media one
0
0
7
@bentomlai
Bento - Run Inference at Scale
11 days
๐Ÿš€ Try Canary Deployments on our #BentoInferencePlatform! Donโ€™t risk regressions or broken user experiences when rolling out new model versions. โœ… Deploy multiple Bento versions at once.๐ŸŽฏ Smart traffic routing strategies.๐Ÿ“Š Real-time monitoring for different versions.โฑ๏ธ
Tweet media one
0
0
1
@bentomlai
Bento - Run Inference at Scale
12 days
Your LLM demo wowed the execs. Itโ€™s time to ship it. Now comes the hard part: scaling inference efficiently, optimizing inference performance, and managing heterogeneous LLM workflows and compute environments. But this is not an easy feat. It requires a set of best practices
Tweet media one
0
0
1
@bentomlai
Bento - Run Inference at Scale
16 days
#BentoFriday ๐Ÿฑ โ€” Lightning-Fast Model Loading. When deploying #LLM services, slow model loading can cripple your cold starts. ๐Ÿšจ This leads to delayed autoscaling, missed requests during traffic spikes, and a poor user experience. #BentoML supercharges model loading with speed
Tweet media one
0
0
1
@bentomlai
Bento - Run Inference at Scale
23 days
#BentoFriday ๐Ÿฑ โ€” WebSocket Endpoints. Real-time #AI apps like voice assistants and live chatbots need more than just REST APIs. They need persistent, low-latency connections. โšก. But spinning up a separate #WebSocket server just for that?.โŒ Duplicated infra.โŒ Complex routing
Tweet media one
0
0
2
@bentomlai
Bento - Run Inference at Scale
27 days
๐ŸฑSpreading the word on #BentoML at #KubeCon + #CloudNativeCon ๐Ÿ™Œ Big shoutout to @fog_glutamine.
@fog_glutamine
FogDong
27 days
Just wrapped up my talks at KubeCon + CloudNativeCon Hong Kong and KubeCon Japan!. Great to connect with the cloud native community and share some ideas. Thanks to everyone who joined! ๐Ÿ™Œ. #KubeCon #CNCF #CloudNative #Kubernetes
Tweet media one
Tweet media two
Tweet media three
0
0
3
@bentomlai
Bento - Run Inference at Scale
30 days
#BentoFriday ๐Ÿฑ โ€” Add a Web UI with @Gradio . Real-world #AI apps donโ€™t just need a model. They need interfaces users can interact with. But building a custom frontend is time-consuming and managing it separately from your backend adds unnecessary complexity. ๐Ÿ˜ตโ€๐Ÿ’ซ. With #BentoML,
Tweet media one
0
2
6
@bentomlai
Bento - Run Inference at Scale
1 month
Weโ€™re entering a new era for #LLMInference: moving beyond single-node optimizations to distributed serving strategies that unlock better performance, smarter resource utilization, and real cost savings. In this blog post, we break down the latest techniques for distributed LLM
Tweet media one
1
2
8
@bentomlai
Bento - Run Inference at Scale
1 month
๐Ÿš€ #Magistral, @MistralAIโ€™s first reasoning model, is here and now deployable with #BentoML!. This release features two variants:.- Magistral Small: 24B parameter open-source version.- Magistral Medium: Enterprise-grade, high-performance version. Highlights of Magistral Small:.๐Ÿ”ง
0
0
9
@bentomlai
Bento - Run Inference at Scale
1 month
#BentoFriday ๐Ÿฑ โ€” Runtime Specs in Pure Python. Deploying #AI services isnโ€™t just about your model code; itโ€™s also about getting the right runtime and making sure it is reproducible across environments. That might include:. ๐Ÿ Python version.๐Ÿ–ฅ๏ธ OS & system packages.๐Ÿ“ฆ Python
Tweet media one
0
0
2
@bentomlai
Bento - Run Inference at Scale
1 month
Choosing the right #AI deployment platform? Check out our detailed comparison of #BentoML vs #VertexAI to help you make informed decisions. ๐Ÿ” Hereโ€™s what we cover:.โœ… Cloud infrastructure flexibility.โœ… Scaling and performance.โœ… Developer experience and.
0
0
2
@bentomlai
Bento - Run Inference at Scale
1 month
๐Ÿ‘€ Update on DeepSeek-R1-0528 ๐Ÿง  Built on V3 Base.๐Ÿ“ˆ Major reasoning improvements.๐Ÿ›ก๏ธ Reduced hallucination.โš™๏ธ Function calling + JSON output.๐Ÿ“ฆ Distilled Qwen3-8B beats much larger models.๐Ÿ“„ Still MIT. See our updated blog โฌ‡๏ธ.#AI #LLM #BentoML #OpenSource.
0
0
2
@bentomlai
Bento - Run Inference at Scale
2 months
๐Ÿš€ DeepSeek-R1-0528 just landed!. ๐Ÿ” Still no official word โ€” no model card, no benchmarks. #DeepSeek being DeepSeek, as always ๐Ÿ˜…. โœ… Good news: #BentoML already supports it. ๐Ÿ‘‰ Deploy it now with our updated example: ๐Ÿ‘€ Follow for more updates!
Tweet media one
0
0
4
@bentomlai
Bento - Run Inference at Scale
2 months
#BentoFriday ๐Ÿฑ โ€” Inference Context with ๐˜ฃ๐˜ฆ๐˜ฏ๐˜ต๐˜ฐ๐˜ฎ๐˜ญ.๐˜Š๐˜ฐ๐˜ฏ๐˜ต๐˜ฆ๐˜น๐˜ต. Building #AI/ML APIs isnโ€™t just about calling a model. You need a clean, reliable way to customize your inference service. ๐˜ฃ๐˜ฆ๐˜ฏ๐˜ต๐˜ฐ๐˜ฎ๐˜ญ.๐˜Š๐˜ฐ๐˜ฏ๐˜ต๐˜ฆ๐˜น๐˜ต is one of those abstractions in #BentoML that gives
Tweet media one
0
0
5
@bentomlai
Bento - Run Inference at Scale
2 months
Want to self-host model inference in production? Start with the right model. Weโ€™ve put together a series exploring popular open-source models. Ready to deploy with #BentoML ๐Ÿฑ.๐Ÿ—ฃ๏ธย Text-to-Speech ๐Ÿ–ผ๏ธย Image Generation ๐Ÿง ย Embedding
Tweet media one
0
0
2
@bentomlai
Bento - Run Inference at Scale
2 months
๐Ÿš€ Build CI/CD pipelines for #AI services with #BentoML + #GitHubActions. Automate everything with pipelines that:.โœ… Deploy services to #BentoCloud.โœ… Trigger on code or deployment config changes.โœ… Wait until the service is ready.โœ… Run test inference. ๐Ÿ“˜ Step-by-step guide:
Tweet media one
0
0
3
@bentomlai
Bento - Run Inference at Scale
2 months
#BentoFriday ๐Ÿฑ โ€” 20x Faster Iteration with BentoML Codespaces. Modern #AI apps like #RAG or voice agents often require multiple powerful GPUs and complex dependencies. This often leads to:.โŒ Painstaking delays with each code change.โŒย Challenging environment setups.โŒ
0
1
1
@bentomlai
Bento - Run Inference at Scale
2 months
๐Ÿš€ Self-host models with #Triton + #BentoML!. @nvidia Triton Inference Server is a powerful open-source tool for serving models from major ML frameworks like ONNX, PyTorch, and TensorFlow. This project wraps Triton with BentoML, making it easy to:.๐ŸŽฏ Package custom models as
Tweet media one
0
0
1
@bentomlai
Bento - Run Inference at Scale
2 months
Enterprises canโ€™t scale #AI inference without compromises. The ๐—š๐—ฃ๐—จ ๐—–๐—”๐—ฃ ๐—ง๐—ต๐—ฒ๐—ผ๐—ฟ๐—ฒ๐—บ says you canโ€™t have all three at once:.๐Ÿ”’ ๐—–๐—ผ๐—ป๐˜๐—ฟ๐—ผ๐—น over your models & data and compliance.โšก ๐—”๐˜ƒ๐—ฎ๐—ถ๐—น๐—ฎ๐—ฏ๐—ถ๐—น๐—ถ๐˜๐˜† to scale on demand when traffic spikes.๐Ÿ’ฐ ๐—ฃ๐—ฟ๐—ถ๐—ฐ๐—ฒ that keeps
Tweet media one
0
0
1