Boyi Li
@Boyiliee
Followers
2K
Following
746
Media
19
Statuses
106
Excited to unveil @nvidia's latest work on #Reasoning Vision–Language–Action (#VLA) models — Alpamayo-R1! Alpamayo-R1 is a new #reasoning VLA architecture featuring a diffusion-based action expert built on top of the #Cosmos-#Reason backbone. It represents one of the core
nvidianews.nvidia.com
NVIDIA today announced it is partnering with Uber to scale the world’s largest level 4-ready mobility network, using the company’s next-generation robotaxi and autonomous delivery fleets, the new...
4
37
219
🎉 Excited to share RecA: Reconstruction Alignment Improves Unified Multimodal Models 🔥 Post-train w/ RecA: 8k images & 4 hours (8 GPUs) → SOTA UMMs: GenEval 0.73→0.90 | DPGBench 80.93→88.15 | ImgEdit 3.38→3.75 Code: https://t.co/yFEvJ0Algw 1/n
6
30
80
Happy to share that we’ve open-sourced the code and demo of 3DHM! Link: https://t.co/LCkgfizrAh Kudos to @JunmingChenleo @brjathu @YGandelsman #Alyosha & @JitendraMalikCV Your feedback is welcome and valuable to us 📝 Beyond humans, 3DHM can animate a humanoid as well 🤠
I’ve dreamt of creating a tool that could animate anyone with any motion from just ONE image… and now it’s a reality! 🎉 Super excited to introduce updated 3DHM: Synthesizing Moving People with 3D Control. 🕺💃3DHM can generate human videos from a single real or synthetic human
0
2
6
📣 Call for Submissions - X-Sense Workshop #ICCV2025! Let's accelerate Ego-Exo Sensing for cleaner, safer, and more efficient mobility experience! 🚀✨ 📅 Deadline: August 31, 2025 09:59 AM GMT 📝 Submission Portal: https://t.co/Bh24Bgf2Y3
📢 The first X-Sense Workshop: Ego-Exo Sensing for Smart Mobility at #ICCV2025! 🎤 We’re honored to host an outstanding speaker lineup, featuring Manmohan Chandraker, @BharathHarihar3, @wucathy, Holger Caesar, @zhoubolei, @Boyiliee, Katie Luo https://t.co/FmVGnwv906
2
7
16
We’re now accepting applications for the 2026–2027 NVIDIA Graduate Fellowships! If you’re passionate about advancing cutting-edge reasoning models for Physical AI applications 🚗🤖, apply here: https://t.co/ZAzpxXxsDS — and be sure to select “Autonomous Vehicles.” @NVIDIAAI
1
27
107
We just dropped a few new PS3 models, with SOTA performance compared to existing vision encoders such as SigLIP2, C-RADIOv2, AIMv2, InternViT2.5, and Perception Encoder! Coming along with several new VILA-HD models. Check it out👇 Models: https://t.co/UwjpBWpFBj Code:
4
16
85
@nvidia research will present a few NLP works at @aclmeeting ACL 2025 at Austria 🇦🇹 NEKO: Cross-Modality Post-Recognition Error Correction with Tasks-Guided Mixture-of-Experts LM Industry Session Oral, Hall L 11:00 to 12:30 CET, Monday https://t.co/bYzzdZSroM (@yentinglin56
1
7
30
Can we use simulation to validate Physical AI? Yes—with far fewer real-world tests. We propose a control variates–based estimation framework that pairs sim & real data to dramatically cut validation costs. #AI #Robotics #Sim2Real" Paper: https://t.co/x870ZHVQYW
@NVIDIADRIVE
1
16
38
Excited to share that Describe Anything has been accepted at ICCV 2025! 🎉 Describe Anything Model (DAM) is a powerful Multimodal LLM that generates detailed descriptions for user-specified regions in images or videos using points, boxes, scribbles, or masks. Open-source code,
describe-anything.github.io
Describe Anything: Detailed Localized Image and Video Captioning
2
26
118
Happy to share our latest work on efficient sensor tokenization for end-to-end driving architectures! https://t.co/nkYUIzfyJT We introduce a novel way to tokenize multi-camera input for AV Transformers that is resolution- and camera-count-agnostic, yet geometry-aware 🧵👇
arxiv.org
Autoregressive Transformers are increasingly being deployed as end-to-end robot and autonomous vehicle (AV) policy architectures, owing to their scalability and potential to leverage...
2
15
30
How to equip robot with super human sensory capabilities? Come join us at RSS 2025 workshop, June21, on Multimodal Robotics with Multisensory capabilities to learn more. Featuring speakers: @JitendraMalikCV, Katherine J. Kuchenbecker, Kristen Grauman, @YunzhuLiYZ, @Boyiliee
1
3
11
Nvidia just dropped Describe Anything on Hugging Face Detailed Localized Image and Video Captioning
7
155
913
Introducing the Describe Anything Model (DAM), a powerful Multimodal LLM that generates detailed descriptions for user-specified regions in images or videos using points, boxes, scribbles, or masks. Open-source code, models, demo, data, and benchmark at: https://t.co/2p3yM6mSUF
12
80
417
👉🏻 We have released our code and benchmark data at https://t.co/5JRXtfE0iP. At #GTC 2025, we evaluated the safety and comfort of autonomous driving using Wolf: https://t.co/HguJUGI68i.
🚀 Introducing 𝐖𝐨𝐥𝐟 🐺: a mixture-of-experts video captioning framework that outperforms GPT-4V and Gemini-Pro-1.5 in general scenes 🖼️, autonomous driving 🚗, and robotics videos 🤖. 👑: https://t.co/cOEfUvRL0m
1
6
64
Hallucination is a big challenge in video understanding for any single model. To address this, we introduce Wolf 🐺 ( https://t.co/bVgyZRFkCc): a mixture-of-experts framework designed for accurate video understanding by distilling knowledge from various Vision-Language Models.
1
5
24
4K Resolution! Vision is a critical part in building powerful multimodal foundation models. Super excited about this work.
Next-gen vision pre-trained models shouldn’t be short-sighted. Humans can easily perceive 10K x 10K resolution. But today’s top vision models—like SigLIP and DINOv2—are still pre-trained at merely hundreds by hundreds of pixels, bottlenecking their real-world usage. Today, we
2
3
49
Next-gen vision pre-trained models shouldn’t be short-sighted. Humans can easily perceive 10K x 10K resolution. But today’s top vision models—like SigLIP and DINOv2—are still pre-trained at merely hundreds by hundreds of pixels, bottlenecking their real-world usage. Today, we
27
155
980
🚀 New Paper Alert! 🚀 Introducing TULIP 🌷 – a multimodal framework for richer vision-language understanding! A drop-in replacement for CLIP-style models, TULIP learns fine-grained visual details while keeping strong language alignment. 🔗 https://t.co/vm0mSdJ2ul 🧵👇
2
13
31
For the first time ever, @nvidia is hosting an AV Safety Day at GTC - a multi-session workshop on AV safety. We will share our latest work on safe AV platforms, run-time monitoring, safety data flywheels, and more! #AutonomousVehicles #AI at #GTC25 ➡️
0
15
32
Nice to see the progress in interactive task planning. It reminds me of our previous work, ITP, which incorporates both high-level planning and low-level function execution via language.
Can we prompt robots, just like we prompt language models? With hierarchy of VLA models + LLM-generated data, robots can: - reason through long-horizon tasks - respond to variety of prompts - handle situated corrections Blog post & paper: https://t.co/s8xka1yTvy
0
1
35