
Bento - Run Inference at Scale
@bentomlai
Followers
2K
Following
476
Media
155
Statuses
631
๐ฑ Inference platform for self-hosting LLMs with tailored performance optimization and up to 6ร cost savings. Join Community ๐ https://t.co/qiBMpvzX9O
San Francisco, CA
Joined May 2017
Modern #AI agents donโt just generate text. They write and run code. But what happens when:. โ ๏ธ A prompt injection triggers harmful logic?.โ ๏ธ An agent runs unknown scripts from a Git repo?.โ ๏ธ It connects to untrusted APIs?. Letting agents run arbitrary code without guardrails is
0
0
7
๐ Try Canary Deployments on our #BentoInferencePlatform! Donโt risk regressions or broken user experiences when rolling out new model versions. โ
Deploy multiple Bento versions at once.๐ฏ Smart traffic routing strategies.๐ Real-time monitoring for different versions.โฑ๏ธ
0
0
1
#BentoFriday ๐ฑ โ Lightning-Fast Model Loading. When deploying #LLM services, slow model loading can cripple your cold starts. ๐จ This leads to delayed autoscaling, missed requests during traffic spikes, and a poor user experience. #BentoML supercharges model loading with speed
0
0
1
#BentoFriday ๐ฑ โ WebSocket Endpoints. Real-time #AI apps like voice assistants and live chatbots need more than just REST APIs. They need persistent, low-latency connections. โก. But spinning up a separate #WebSocket server just for that?.โ Duplicated infra.โ Complex routing
0
0
2
๐ฑSpreading the word on #BentoML at #KubeCon + #CloudNativeCon ๐ Big shoutout to @fog_glutamine.
Just wrapped up my talks at KubeCon + CloudNativeCon Hong Kong and KubeCon Japan!. Great to connect with the cloud native community and share some ideas. Thanks to everyone who joined! ๐. #KubeCon #CNCF #CloudNative #Kubernetes
0
0
3
#BentoFriday ๐ฑ โ Add a Web UI with @Gradio . Real-world #AI apps donโt just need a model. They need interfaces users can interact with. But building a custom frontend is time-consuming and managing it separately from your backend adds unnecessary complexity. ๐ตโ๐ซ. With #BentoML,
0
2
6
Weโre entering a new era for #LLMInference: moving beyond single-node optimizations to distributed serving strategies that unlock better performance, smarter resource utilization, and real cost savings. In this blog post, we break down the latest techniques for distributed LLM
1
2
8
๐ #Magistral, @MistralAIโs first reasoning model, is here and now deployable with #BentoML!. This release features two variants:.- Magistral Small: 24B parameter open-source version.- Magistral Medium: Enterprise-grade, high-performance version. Highlights of Magistral Small:.๐ง
0
0
9
#BentoFriday ๐ฑ โ Runtime Specs in Pure Python. Deploying #AI services isnโt just about your model code; itโs also about getting the right runtime and making sure it is reproducible across environments. That might include:. ๐ Python version.๐ฅ๏ธ OS & system packages.๐ฆ Python
0
0
2
๐ Update on DeepSeek-R1-0528 ๐ง Built on V3 Base.๐ Major reasoning improvements.๐ก๏ธ Reduced hallucination.โ๏ธ Function calling + JSON output.๐ฆ Distilled Qwen3-8B beats much larger models.๐ Still MIT. See our updated blog โฌ๏ธ.#AI #LLM #BentoML #OpenSource.
0
0
2
#BentoFriday ๐ฑ โ Inference Context with ๐ฃ๐ฆ๐ฏ๐ต๐ฐ๐ฎ๐ญ.๐๐ฐ๐ฏ๐ต๐ฆ๐น๐ต. Building #AI/ML APIs isnโt just about calling a model. You need a clean, reliable way to customize your inference service. ๐ฃ๐ฆ๐ฏ๐ต๐ฐ๐ฎ๐ญ.๐๐ฐ๐ฏ๐ต๐ฆ๐น๐ต is one of those abstractions in #BentoML that gives
0
0
5
Want to self-host model inference in production? Start with the right model. Weโve put together a series exploring popular open-source models. Ready to deploy with #BentoML ๐ฑ.๐ฃ๏ธย Text-to-Speech ๐ผ๏ธย Image Generation ๐ง ย Embedding
0
0
2
๐ Build CI/CD pipelines for #AI services with #BentoML + #GitHubActions. Automate everything with pipelines that:.โ
Deploy services to #BentoCloud.โ
Trigger on code or deployment config changes.โ
Wait until the service is ready.โ
Run test inference. ๐ Step-by-step guide:
0
0
3
#BentoFriday ๐ฑ โ 20x Faster Iteration with BentoML Codespaces. Modern #AI apps like #RAG or voice agents often require multiple powerful GPUs and complex dependencies. This often leads to:.โ Painstaking delays with each code change.โย Challenging environment setups.โ
0
1
1
Enterprises canโt scale #AI inference without compromises. The ๐๐ฃ๐จ ๐๐๐ฃ ๐ง๐ต๐ฒ๐ผ๐ฟ๐ฒ๐บ says you canโt have all three at once:.๐ ๐๐ผ๐ป๐๐ฟ๐ผ๐น over your models & data and compliance.โก ๐๐๐ฎ๐ถ๐น๐ฎ๐ฏ๐ถ๐น๐ถ๐๐ to scale on demand when traffic spikes.๐ฐ ๐ฃ๐ฟ๐ถ๐ฐ๐ฒ that keeps
0
0
1