Boyi Li @Boyiliee X Profile

Boyi Li

@Boyiliee

Followers

2K

Following

746

Media

19

Statuses

106

https://t.co/WhlpweDU5y

Joined March 2020

Don't wanna be here? Send us removal request.

Marco Pavone

@drmapavone

1 day

Excited to unveil @nvidia's latest work on #Reasoning Vision–Language–Action (#VLA) models — Alpamayo-R1! Alpamayo-R1 is a new #reasoning VLA architecture featuring a diffusion-based action expert built on top of the #Cosmos-#Reason backbone. It represents one of the core

nvidianews.nvidia.com

NVIDIA today announced it is partnering with Uber to scale the world’s largest level 4-ready mobility network, using the company’s next-generation robotaxi and autonomous delivery fleets, the new...

4

37

219

XuDong Wang

@XDWang101

2 months

🎉 Excited to share RecA: Reconstruction Alignment Improves Unified Multimodal Models 🔥 Post-train w/ RecA: 8k images & 4 hours (8 GPUs) → SOTA UMMs: GenEval 0.73→0.90 | DPGBench 80.93→88.15 | ImgEdit 3.38→3.75 Code: https://t.co/yFEvJ0Algw 1/n

6

30

80

Boyi Li

@Boyiliee

2 months

Happy to share that we’ve open-sourced the code and demo of 3DHM! Link: https://t.co/LCkgfizrAh Kudos to @JunmingChenleo @brjathu @YGandelsman #Alyosha & @JitendraMalikCV Your feedback is welcome and valuable to us 📝 Beyond humans, 3DHM can animate a humanoid as well 🤠

Boyi Li

@Boyiliee

11 months

I’ve dreamt of creating a tool that could animate anyone with any motion from just ONE image… and now it’s a reality! 🎉 Super excited to introduce updated 3DHM: Synthesizing Moving People with 3D Control. 🕺💃3DHM can generate human videos from a single real or synthetic human

0

2

6

Yi-Ting Chen

@chen_yiting_TW

3 months

📣 Call for Submissions - X-Sense Workshop #ICCV2025! Let's accelerate Ego-Exo Sensing for cleaner, safer, and more efficient mobility experience! 🚀✨ 📅 Deadline: August 31, 2025 09:59 AM GMT 📝 Submission Portal: https://t.co/Bh24Bgf2Y3

Yi-Ting Chen

@chen_yiting_TW

6 months

📢 The first X-Sense Workshop: Ego-Exo Sensing for Smart Mobility at #ICCV2025! 🎤 We’re honored to host an outstanding speaker lineup, featuring Manmohan Chandraker, @BharathHarihar3, @wucathy, Holger Caesar, @zhoubolei, @Boyiliee, Katie Luo https://t.co/FmVGnwv906

2

7

16

Marco Pavone

@drmapavone

3 months

We’re now accepting applications for the 2026–2027 NVIDIA Graduate Fellowships! If you’re passionate about advancing cutting-edge reasoning models for Physical AI applications 🚗🤖, apply here: https://t.co/ZAzpxXxsDS — and be sure to select “Autonomous Vehicles.” @NVIDIAAI

1

27

107

Baifeng

@baifeng_shi

3 months

We just dropped a few new PS3 models, with SOTA performance compared to existing vision encoders such as SigLIP2, C-RADIOv2, AIMv2, InternViT2.5, and Perception Encoder! Coming along with several new VILA-HD models. Check it out👇 Models: https://t.co/UwjpBWpFBj Code:

4

16

85

Huck Yang

@huckiyang

3 months

@nvidia research will present a few NLP works at @aclmeeting ACL 2025 at Austria 🇦🇹 NEKO: Cross-Modality Post-Recognition Error Correction with Tasks-Guided Mixture-of-Experts LM Industry Session Oral, Hall L 11:00 to 12:30 CET, Monday https://t.co/bYzzdZSroM (@yentinglin56

1

7

30

Marco Pavone

@drmapavone

4 months

Can we use simulation to validate Physical AI? Yes—with far fewer real-world tests. We propose a control variates–based estimation framework that pairs sim & real data to dramatically cut validation costs. #AI #Robotics #Sim2Real" Paper: https://t.co/x870ZHVQYW @NVIDIADRIVE

1

16

38

Long Lian

@LongTonyLian

4 months

Excited to share that Describe Anything has been accepted at ICCV 2025! 🎉 Describe Anything Model (DAM) is a powerful Multimodal LLM that generates detailed descriptions for user-specified regions in images or videos using points, boxes, scribbles, or masks. Open-source code,

describe-anything.github.io

Describe Anything: Detailed Localized Image and Video Captioning

AK

@_akhaliq

7 months

Nvidia just dropped Describe Anything on Hugging Face Detailed Localized Image and Video Captioning

2

26

118

Boris Ivanovic

@iamborisi

4 months

Happy to share our latest work on efficient sensor tokenization for end-to-end driving architectures! https://t.co/nkYUIzfyJT We introduce a novel way to tokenize multi-camera input for AV Transformers that is resolution- and camera-count-agnostic, yet geometry-aware 🧵👇

arxiv.org

Autoregressive Transformers are increasingly being deployed as end-to-end robot and autonomous vehicle (AV) policy architectures, owing to their scalability and potential to leverage...

2

15

30

Anthea Li

@AntheaYLi

5 months

How to equip robot with super human sensory capabilities? Come join us at RSS 2025 workshop, June21, on Multimodal Robotics with Multisensory capabilities to learn more. Featuring speakers: @JitendraMalikCV, Katherine J. Kuchenbecker, Kristen Grauman, @YunzhuLiYZ, @Boyiliee

1

3

11

AK

@_akhaliq

7 months

Nvidia just dropped Describe Anything on Hugging Face Detailed Localized Image and Video Captioning

7

155

913

Yin Cui

@YinCuiCV

7 months

Introducing the Describe Anything Model (DAM), a powerful Multimodal LLM that generates detailed descriptions for user-specified regions in images or videos using points, boxes, scribbles, or masks. Open-source code, models, demo, data, and benchmark at: https://t.co/2p3yM6mSUF

12

80

417

Boyi Li

@Boyiliee

7 months

👉🏻 We have released our code and benchmark data at https://t.co/5JRXtfE0iP. At #GTC 2025, we evaluated the safety and comfort of autonomous driving using Wolf: https://t.co/HguJUGI68i.

Boyi Li

@Boyiliee

1 year

🚀 Introducing 𝐖𝐨𝐥𝐟 🐺: a mixture-of-experts video captioning framework that outperforms GPT-4V and Gemini-Pro-1.5 in general scenes 🖼️, autonomous driving 🚗, and robotics videos 🤖. 👑: https://t.co/cOEfUvRL0m

1

6

64

Boyi Li

@Boyiliee

7 months

Hallucination is a big challenge in video understanding for any single model. To address this, we introduce Wolf 🐺 ( https://t.co/bVgyZRFkCc): a mixture-of-experts framework designed for accurate video understanding by distilling knowledge from various Vision-Language Models.

1

5

24

Boyi Li

@Boyiliee

7 months

4K Resolution! Vision is a critical part in building powerful multimodal foundation models. Super excited about this work.

Baifeng

@baifeng_shi

7 months

Next-gen vision pre-trained models shouldn’t be short-sighted. Humans can easily perceive 10K x 10K resolution. But today’s top vision models—like SigLIP and DINOv2—are still pre-trained at merely hundreds by hundreds of pixels, bottlenecking their real-world usage. Today, we

2

3

49

Baifeng

@baifeng_shi

7 months

Next-gen vision pre-trained models shouldn’t be short-sighted. Humans can easily perceive 10K x 10K resolution. But today’s top vision models—like SigLIP and DINOv2—are still pre-trained at merely hundreds by hundreds of pixels, bottlenecking their real-world usage. Today, we

27

155

980

David Chan

@_dmchan

8 months

🚀 New Paper Alert! 🚀 Introducing TULIP 🌷 – a multimodal framework for richer vision-language understanding! A drop-in replacement for CLIP-style models, TULIP learns fine-grained visual details while keeping strong language alignment. 🔗 https://t.co/vm0mSdJ2ul 🧵👇

2

13

31

Marco Pavone

@drmapavone

8 months

For the first time ever, @nvidia is hosting an AV Safety Day at GTC - a multi-session workshop on AV safety. We will share our latest work on safe AV platforms, run-time monitoring, safety data flywheels, and more! #AutonomousVehicles #AI at #GTC25 ➡️

0

15

32

Boyi Li

@Boyiliee

8 months

Nice to see the progress in interactive task planning. It reminds me of our previous work, ITP, which incorporates both high-level planning and low-level function execution via language.

Chelsea Finn

@chelseabfinn

8 months

Can we prompt robots, just like we prompt language models? With hierarchy of VLA models + LLM-generated data, robots can: - reason through long-horizon tasks - respond to variety of prompts - handle situated corrections Blog post & paper: https://t.co/s8xka1yTvy

0

1

35