Miran Heo @miran_heo X Profile

Miran Heo

@miran_heo

Followers

193

Following

490

Media

7

Statuses

46

Research Scientist at Meta

https://t.co/5y16IGKwdL

New York, NY

Joined December 2022

Don't wanna be here? Send us removal request.

Miran Heo

@miran_heo

2 months

We connect the autoregressive pipeline of LLMs with streaming video perception. Introducing AUSM: Autoregressive Universal Video Segmentation Model. A step toward unified, scalable video perception — inspired by how LLMs unified NLP. 📝

arxiv.org

Recent video foundation models such as SAM2 excel at prompted video segmentation by treating masks as a general-purpose primitive. However, many real-world settings require unprompted segmentation...

2

28

142

Saining Xie

@sainingxie

7 days

Introducing Cambrian-S it’s a position, a dataset, a benchmark, and a model but above all, it represents our first steps toward exploring spatial supersensing in video. 🧶

25

93

620

Miran Heo

@miran_heo

2 months

Thanks @_akhaliq for sharing our work! Check out more details:

AK

@_akhaliq

3 months

Nvidia presents Autoregressive Universal Video Segmentation Model

0

8

46

Miran Heo

@miran_heo

2 months

This marks my last project as a student, which makes it especially meaningful. Grateful to have worked on it during my internship at NVIDIA, together with @sukjun_hwang, @CMHungSteven, Yu-Chiang Frank Wang, @_albertgu, Seon Joo Kim, @RHachiuma

0

11

Miran Heo

@miran_heo

2 months

On seven popular benchmarks, AUSM delivers strong performance across all benchmarks, outperforming previous online video universal segmentation models. Importantly, these results are achieved without relying on FIFO memory buffers, highlighting the efficiency and scalability of

1

0

10

Miran Heo

@miran_heo

2 months

AUSM supports parallel training, a critical property of building blocks in decoder-only LLMs for extending to long sequences. Our training pipeline shows significant speed-ups, achieving up to 2.5× faster training than iterative baselines under 16 frames setting.

1

0

7

Miran Heo

@miran_heo

2 months

AUSM introduces: - History Marker → dissolves segmentation masks into frame features (via Token Mark), preserving fine instance details. ⚡ ~+10% VOS gain over prior unified architectures. - History Compressor → condenses spatio-temporal info from all past frames into a

1

0

10

Miran Heo

@miran_heo

2 months

Video segmentation today is fragmented: - Prompted methods (e.g., VOS) need user input, but miss new objects. - Unprompted methods (e.g., VIS) handle all objects, but often lose fine temporal detail. 👉 AUSM unifies both under a single autoregressive framework — drawing a key

1

2

12

Hanjung Kim

@KimD0ing

3 months

Happy to announce our paper, UniSkill, is accepted to CoRL 2025! 🤖 Learning from human videos is the future of robot learning, but the cross-embodiment gap has been a major barrier. We introduce a simple yet powerful way to bridge this gap. Looking forward to see in Korea!🇰🇷

Hanjung Kim

@KimD0ing

6 months

How can we effectively leverage human videos for robot learning by bridging the inherent embodiment gap? We introduce UniSkill, a universal skill representation, a scalable method for learning cross-embodiment skill representations from large-scale in-the-wild video data. 1/n

5

7

101

Huck Yang

@huckiyang

4 months

@nvidia research will present a few NLP works at @aclmeeting ACL 2025 at Austria 🇦🇹 NEKO: Cross-Modality Post-Recognition Error Correction with Tasks-Guided Mixture-of-Experts LM Industry Session Oral, Hall L 11:00 to 12:30 CET, Monday https://t.co/bYzzdZSroM (@yentinglin56

1

7

30

Fu-En (Fred) Yang

@FuEnYang1

4 months

🤖 How can we teach embodied agents to think before they act? 🚀 Introducing ThinkAct — a hierarchical Reasoning VLA framework with an MLLM for complex, slow reasoning and an action expert for fast, grounded execution. Slow think, fast act. 🧠⚡🤲

4

26

103

Miran Heo

@miran_heo

4 months

Great work!

Sukjun (June) Hwang

@sukjun_hwang

4 months

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data

0

7

Jeongseok Hyun

@Jeongseok_hyun

4 months

🎞️ 𝐃𝐨𝐮𝐛𝐥𝐞 𝐭𝐡𝐞 𝐒𝐩𝐞𝐞𝐝, 𝐙𝐞𝐫𝐨 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠: 𝐓𝐡𝐞 𝐅𝐫𝐞𝐞 𝐋𝐮𝐧𝐜𝐡 𝐟𝐨𝐫 𝐕𝐢𝐝𝐞𝐨 𝐋𝐋𝐌𝐬! ⚡️ 🚨I am excited to share that our paper is accepted #ICCV2025 @ICCVConference ArXiv paper: https://t.co/8dnqmHoAsm Project page:

jshyun.me

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs Jeongseok Hyun1 Sukjun Hwang2 Su Ho Han1 Taeoh Kim3 Inwoong Lee3 Dongyoon Wee3 Joon-Young Lee4 Seon Joo Kim1...

1

6

16

Albert Gu

@_albertgu

4 months

I converted one of my favorite talks I've given over the past year into a blog post. "On the Tradeoffs of SSMs and Transformers" (or: tokens are bullshit) In a few days, we'll release what I believe is the next major advance for architectures.

27

117

784

Ryo Hachiuma

@RHachiuma

5 months

🚨 Excited to announce the STREAM Workshop at #ICCV2025! 📝 Call for Papers is now open – we welcome full papers (8 pages) and extended abstracts (2–4 pages). 🔍 Topics: Trustworthy, Fairness, and Explainable AI, etc. 📎 Details & submission: https://t.co/qOJL8LIFCN

1

6

19

Ryo Hachiuma

@RHachiuma

5 months

(1/3) Although I won’t be attending CVPR in person this year, two of our papers will be presented! Feel free to stop by the poster sessions to check them out! Paper links in the thread below... #CVPR2025

1

2

9

Bosung Kim

@bosungkim17

6 months

Interactive looong-context reasoning still has a long way to go. We need progress across all axes: more data, bigger model, and smarter architectures. ∞-THOR is just beginning: generate ∞-len trajectories, run agents online train with feedback and more! Let’s push the limits🚀

Prithviraj (Raj) Ammanabrolu

@rajammanabrolu

6 months

"Foundation" models for embodied agents are all the rage but how to actually do complex looong context reasoning? Can we scale Beyond Needle(s) in the (Embodied) Haystack? ∞-THOR is an infinite len sim framework + guide on (new) architectures/training methods for VLA models

0

6

17

Prithviraj (Raj) Ammanabrolu

@rajammanabrolu

6 months

"Foundation" models for embodied agents are all the rage but how to actually do complex looong context reasoning? Can we scale Beyond Needle(s) in the (Embodied) Haystack? ∞-THOR is an infinite len sim framework + guide on (new) architectures/training methods for VLA models

1

10

44

Hanjung Kim

@KimD0ing

6 months

How can we effectively leverage human videos for robot learning by bridging the inherent embodiment gap? We introduce UniSkill, a universal skill representation, a scalable method for learning cross-embodiment skill representations from large-scale in-the-wild video data. 1/n

4

32

190

Min-Hung (Steve) Chen

@CMHungSteven

7 months

The 4th Workshop on Transformers for Vision (T4V) at CVPR 2025 is soliciting self-nominations for reviewers. If you're interested, please fill out this form: https://t.co/8X7HKD43Ia More information can be found on our website: https://t.co/nRRhNR1NUj

2

11

27