Miran Heo
@miran_heo
Followers
193
Following
490
Media
7
Statuses
46
Research Scientist at Meta
New York, NY
Joined December 2022
We connect the autoregressive pipeline of LLMs with streaming video perception. Introducing AUSM: Autoregressive Universal Video Segmentation Model. A step toward unified, scalable video perception โ inspired by how LLMs unified NLP. ๐
arxiv.org
Recent video foundation models such as SAM2 excel at prompted video segmentation by treating masks as a general-purpose primitive. However, many real-world settings require unprompted segmentation...
2
28
142
Introducing Cambrian-S itโs a position, a dataset, a benchmark, and a model but above all, it represents our first steps toward exploring spatial supersensing in video. ๐งถ
25
93
620
Thanks @_akhaliq for sharing our work! Check out more details:
0
8
46
This marks my last project as a student, which makes it especially meaningful. Grateful to have worked on it during my internship at NVIDIA, together with @sukjun_hwang, @CMHungSteven, Yu-Chiang Frank Wang, @_albertgu, Seon Joo Kim, @RHachiuma
0
0
11
On seven popular benchmarks, AUSM delivers strong performance across all benchmarks, outperforming previous online video universal segmentation models. Importantly, these results are achieved without relying on FIFO memory buffers, highlighting the efficiency and scalability of
1
0
10
AUSM supports parallel training, a critical property of building blocks in decoder-only LLMs for extending to long sequences. Our training pipeline shows significant speed-ups, achieving up to 2.5ร faster training than iterative baselines under 16 frames setting.
1
0
7
AUSM introduces: - History Marker โ dissolves segmentation masks into frame features (via Token Mark), preserving fine instance details. โก ~+10% VOS gain over prior unified architectures. - History Compressor โ condenses spatio-temporal info from all past frames into a
1
0
10
Video segmentation today is fragmented: - Prompted methods (e.g., VOS) need user input, but miss new objects. - Unprompted methods (e.g., VIS) handle all objects, but often lose fine temporal detail. ๐ AUSM unifies both under a single autoregressive framework โ drawing a key
1
2
12
Happy to announce our paper, UniSkill, is accepted to CoRL 2025! ๐ค Learning from human videos is the future of robot learning, but the cross-embodiment gap has been a major barrier. We introduce a simple yet powerful way to bridge this gap. Looking forward to see in Korea!๐ฐ๐ท
How can we effectively leverage human videos for robot learning by bridging the inherent embodiment gap? We introduce UniSkill, a universal skill representation, a scalable method for learning cross-embodiment skill representations from large-scale in-the-wild video data. 1/n
5
7
101
@nvidia research will present a few NLP works at @aclmeeting ACL 2025 at Austria ๐ฆ๐น NEKO: Cross-Modality Post-Recognition Error Correction with Tasks-Guided Mixture-of-Experts LM Industry Session Oral, Hall L 11:00 to 12:30 CET, Monday https://t.co/bYzzdZSroM (@yentinglin56
1
7
30
๐ค How can we teach embodied agents to think before they act? ๐ Introducing ThinkAct โ a hierarchical Reasoning VLA framework with an MLLM for complex, slow reasoning and an action expert for fast, grounded execution. Slow think, fast act. ๐ง โก๐คฒ
4
26
103
๐๏ธ ๐๐จ๐ฎ๐๐ฅ๐ ๐ญ๐ก๐ ๐๐ฉ๐๐๐, ๐๐๐ซ๐จ ๐๐ซ๐๐ข๐ง๐ข๐ง๐ : ๐๐ก๐ ๐
๐ซ๐๐ ๐๐ฎ๐ง๐๐ก ๐๐จ๐ซ ๐๐ข๐๐๐จ ๐๐๐๐ฌ! โก๏ธ ๐จI am excited to share that our paper is accepted #ICCV2025 @ICCVConference ArXiv paper: https://t.co/8dnqmHoAsm Project page:
jshyun.me
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs Jeongseok Hyun1 Sukjun Hwang2 Su Ho Han1 Taeoh Kim3 Inwoong Lee3 Dongyoon Wee3 Joon-Young Lee4 Seon Joo Kim1...
1
6
16
I converted one of my favorite talks I've given over the past year into a blog post. "On the Tradeoffs of SSMs and Transformers" (or: tokens are bullshit) In a few days, we'll release what I believe is the next major advance for architectures.
27
117
784
๐จ Excited to announce the STREAM Workshop at #ICCV2025! ๐ Call for Papers is now open โ we welcome full papers (8 pages) and extended abstracts (2โ4 pages). ๐ Topics: Trustworthy, Fairness, and Explainable AI, etc. ๐ Details & submission: https://t.co/qOJL8LIFCN
1
6
19
(1/3) Although I wonโt be attending CVPR in person this year, two of our papers will be presented! Feel free to stop by the poster sessions to check them out! Paper links in the thread below... #CVPR2025
1
2
9
Interactive looong-context reasoning still has a long way to go. We need progress across all axes: more data, bigger model, and smarter architectures. โ-THOR is just beginning: generate โ-len trajectories, run agents online train with feedback and more! Letโs push the limits๐
"Foundation" models for embodied agents are all the rage but how to actually do complex looong context reasoning? Can we scale Beyond Needle(s) in the (Embodied) Haystack? โ-THOR is an infinite len sim framework + guide on (new) architectures/training methods for VLA models
0
6
17
"Foundation" models for embodied agents are all the rage but how to actually do complex looong context reasoning? Can we scale Beyond Needle(s) in the (Embodied) Haystack? โ-THOR is an infinite len sim framework + guide on (new) architectures/training methods for VLA models
1
10
44
How can we effectively leverage human videos for robot learning by bridging the inherent embodiment gap? We introduce UniSkill, a universal skill representation, a scalable method for learning cross-embodiment skill representations from large-scale in-the-wild video data. 1/n
4
32
190
The 4th Workshop on Transformers for Vision (T4V)ย at CVPR 2025 is soliciting self-nominations for reviewers. If you're interested, please fill out this form: https://t.co/8X7HKD43Ia More information can be found on our website: https://t.co/nRRhNR1NUj
2
11
27