
Boyi Li
@Boyiliee
Followers
2K
Following
679
Media
25
Statuses
128
Joined March 2020
RT @LongTonyLian: Excited to share that Describe Anything has been accepted at ICCV 2025! 🎉. Describe Anything Model (DAM) is a powerful Mu….
0
17
0
RT @iamborisi: Happy to share our latest work on efficient sensor tokenization for end-to-end driving architectures! .
0
13
0
RT @AntheaYLi: How to equip robot with super human sensory capabilities? .Come join us at RSS 2025 workshop, June21, on Multimodal Robotics….
0
3
0
👉🏻 We have released our code and benchmark data at At #GTC 2025, we evaluated the safety and comfort of autonomous driving using Wolf:
🚀 Introducing 𝐖𝐨𝐥𝐟 🐺: a mixture-of-experts video captioning framework that outperforms GPT-4V and Gemini-Pro-1.5 in general scenes 🖼️, autonomous driving 🚗, and robotics videos 🤖. 👑:
1
6
64
4K Resolution! Vision is a critical part in building powerful multimodal foundation models. Super excited about this work.
Next-gen vision pre-trained models shouldn’t be short-sighted. Humans can easily perceive 10K x 10K resolution. But today’s top vision models—like SigLIP and DINOv2—are still pre-trained at merely hundreds by hundreds of pixels, bottlenecking their real-world usage. Today, we
2
2
48
RT @baifeng_shi: Next-gen vision pre-trained models shouldn’t be short-sighted. Humans can easily perceive 10K x 10K resolution. But today….
0
153
0
RT @drmapavone: At #GTC2025, Jensen unveiled Halos, a comprehensive safety system for AVs and Physical AI. Halos integrates numerous techno….
0
12
0
RT @drmapavone: For the first time ever, @nvidia is hosting an AV Safety Day at GTC - a multi-session workshop on AV safety. We will shar….
0
14
0
Nice to see the progress in interactive task planning. It reminds me of our previous work, ITP, which incorporates both high-level planning and low-level function execution via language.
Can we prompt robots, just like we prompt language models?. With hierarchy of VLA models + LLM-generated data, robots can:.- reason through long-horizon tasks.- respond to variety of prompts.- handle situated corrections. Blog post & paper:
0
1
35
RT @PointsCoder: Can Vision-Language Models (VLMs) truly understand the physical world? 🌍🔬. Introducing PhysBench – the first benchmark to….
0
74
0
RT @_amirbar: @Boyiliee Wow, this is super cool, Boyi! Such a flashback to @carolinemchan and @shiryginosar's 'Everybody Dance Now' 😀.
0
1
0
RT @drmapavone: Complementing DreamDrive, I am thrilled to introduce STORM, which enables fast scene reconstruction with a single feed-forw….
0
31
0
RT @JitendraMalikCV: I'm happy to post course materials for my class at UC Berkeley "Robots that Learn", taught with the outstanding assist….
0
248
0
RT @drmapavone: Introducing DreamDrive, which combines the complementary strengths of generative AI (video diffusion) and neural reconstruc….
0
44
0
RT @JitendraMalikCV: Happy to share these exciting new results on video synthesis of humans in movement. Arguably, these establish the powe….
0
7
0