Miran Heo Profile
Miran Heo

@miran_heo

Followers
193
Following
490
Media
7
Statuses
46

Research Scientist at Meta

New York, NY
Joined December 2022
Don't wanna be here? Send us removal request.
@miran_heo
Miran Heo
2 months
We connect the autoregressive pipeline of LLMs with streaming video perception. Introducing AUSM: Autoregressive Universal Video Segmentation Model. A step toward unified, scalable video perception โ€” inspired by how LLMs unified NLP. ๐Ÿ“
Tweet card summary image
arxiv.org
Recent video foundation models such as SAM2 excel at prompted video segmentation by treating masks as a general-purpose primitive. However, many real-world settings require unprompted segmentation...
2
28
142
@sainingxie
Saining Xie
7 days
Introducing Cambrian-S itโ€™s a position, a dataset, a benchmark, and a model but above all, it represents our first steps toward exploring spatial supersensing in video. ๐Ÿงถ
25
93
620
@miran_heo
Miran Heo
2 months
Thanks @_akhaliq for sharing our work! Check out more details:
@_akhaliq
AK
3 months
Nvidia presents Autoregressive Universal Video Segmentation Model
0
8
46
@miran_heo
Miran Heo
2 months
This marks my last project as a student, which makes it especially meaningful. Grateful to have worked on it during my internship at NVIDIA, together with @sukjun_hwang, @CMHungSteven, Yu-Chiang Frank Wang, @_albertgu, Seon Joo Kim, @RHachiuma
0
0
11
@miran_heo
Miran Heo
2 months
On seven popular benchmarks, AUSM delivers strong performance across all benchmarks, outperforming previous online video universal segmentation models. Importantly, these results are achieved without relying on FIFO memory buffers, highlighting the efficiency and scalability of
1
0
10
@miran_heo
Miran Heo
2 months
AUSM supports parallel training, a critical property of building blocks in decoder-only LLMs for extending to long sequences. Our training pipeline shows significant speed-ups, achieving up to 2.5ร— faster training than iterative baselines under 16 frames setting.
1
0
7
@miran_heo
Miran Heo
2 months
AUSM introduces: - History Marker โ†’ dissolves segmentation masks into frame features (via Token Mark), preserving fine instance details. โšก ~+10% VOS gain over prior unified architectures. - History Compressor โ†’ condenses spatio-temporal info from all past frames into a
1
0
10
@miran_heo
Miran Heo
2 months
Video segmentation today is fragmented: - Prompted methods (e.g., VOS) need user input, but miss new objects. - Unprompted methods (e.g., VIS) handle all objects, but often lose fine temporal detail. ๐Ÿ‘‰ AUSM unifies both under a single autoregressive framework โ€” drawing a key
1
2
12
@KimD0ing
Hanjung Kim
3 months
Happy to announce our paper, UniSkill, is accepted to CoRL 2025! ๐Ÿค– Learning from human videos is the future of robot learning, but the cross-embodiment gap has been a major barrier. We introduce a simple yet powerful way to bridge this gap. Looking forward to see in Korea!๐Ÿ‡ฐ๐Ÿ‡ท
@KimD0ing
Hanjung Kim
6 months
How can we effectively leverage human videos for robot learning by bridging the inherent embodiment gap? We introduce UniSkill, a universal skill representation, a scalable method for learning cross-embodiment skill representations from large-scale in-the-wild video data. 1/n
5
7
101
@huckiyang
Huck Yang
4 months
@nvidia research will present a few NLP works at @aclmeeting ACL 2025 at Austria ๐Ÿ‡ฆ๐Ÿ‡น NEKO: Cross-Modality Post-Recognition Error Correction with Tasks-Guided Mixture-of-Experts LM Industry Session Oral, Hall L 11:00 to 12:30 CET, Monday https://t.co/bYzzdZSroM (@yentinglin56
1
7
30
@FuEnYang1
Fu-En (Fred) Yang
4 months
๐Ÿค– How can we teach embodied agents to think before they act? ๐Ÿš€ Introducing ThinkAct โ€” a hierarchical Reasoning VLA framework with an MLLM for complex, slow reasoning and an action expert for fast, grounded execution. Slow think, fast act. ๐Ÿง โšก๐Ÿคฒ
4
26
103
@miran_heo
Miran Heo
4 months
Great work!
@sukjun_hwang
Sukjun (June) Hwang
4 months
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
0
0
7
@Jeongseok_hyun
Jeongseok Hyun
4 months
๐ŸŽž๏ธ ๐ƒ๐จ๐ฎ๐›๐ฅ๐ž ๐ญ๐ก๐ž ๐’๐ฉ๐ž๐ž๐, ๐™๐ž๐ซ๐จ ๐“๐ซ๐š๐ข๐ง๐ข๐ง๐ : ๐“๐ก๐ž ๐…๐ซ๐ž๐ž ๐‹๐ฎ๐ง๐œ๐ก ๐Ÿ๐จ๐ซ ๐•๐ข๐๐ž๐จ ๐‹๐‹๐Œ๐ฌ! โšก๏ธ ๐ŸšจI am excited to share that our paper is accepted #ICCV2025 @ICCVConference ArXiv paper: https://t.co/8dnqmHoAsm Project page:
jshyun.me
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs Jeongseok Hyun1 Sukjun Hwang2 Su Ho Han1 Taeoh Kim3 Inwoong Lee3 Dongyoon Wee3 Joon-Young Lee4 Seon Joo Kim1...
1
6
16
@_albertgu
Albert Gu
4 months
I converted one of my favorite talks I've given over the past year into a blog post. "On the Tradeoffs of SSMs and Transformers" (or: tokens are bullshit) In a few days, we'll release what I believe is the next major advance for architectures.
27
117
784
@RHachiuma
Ryo Hachiuma
5 months
๐Ÿšจ Excited to announce the STREAM Workshop at #ICCV2025! ๐Ÿ“ Call for Papers is now open โ€“ we welcome full papers (8 pages) and extended abstracts (2โ€“4 pages). ๐Ÿ” Topics: Trustworthy, Fairness, and Explainable AI, etc. ๐Ÿ“Ž Details & submission: https://t.co/qOJL8LIFCN
1
6
19
@RHachiuma
Ryo Hachiuma
5 months
(1/3) Although I wonโ€™t be attending CVPR in person this year, two of our papers will be presented! Feel free to stop by the poster sessions to check them out! Paper links in the thread below... #CVPR2025
1
2
9
@bosungkim17
Bosung Kim
6 months
Interactive looong-context reasoning still has a long way to go. We need progress across all axes: more data, bigger model, and smarter architectures. โˆž-THOR is just beginning: generate โˆž-len trajectories, run agents online train with feedback and more! Letโ€™s push the limits๐Ÿš€
@rajammanabrolu
Prithviraj (Raj) Ammanabrolu
6 months
"Foundation" models for embodied agents are all the rage but how to actually do complex looong context reasoning? Can we scale Beyond Needle(s) in the (Embodied) Haystack? โˆž-THOR is an infinite len sim framework + guide on (new) architectures/training methods for VLA models
0
6
17
@rajammanabrolu
Prithviraj (Raj) Ammanabrolu
6 months
"Foundation" models for embodied agents are all the rage but how to actually do complex looong context reasoning? Can we scale Beyond Needle(s) in the (Embodied) Haystack? โˆž-THOR is an infinite len sim framework + guide on (new) architectures/training methods for VLA models
1
10
44
@KimD0ing
Hanjung Kim
6 months
How can we effectively leverage human videos for robot learning by bridging the inherent embodiment gap? We introduce UniSkill, a universal skill representation, a scalable method for learning cross-embodiment skill representations from large-scale in-the-wild video data. 1/n
4
32
190
@CMHungSteven
Min-Hung (Steve) Chen
7 months
The 4th Workshop on Transformers for Vision (T4V)ย at CVPR 2025 is soliciting self-nominations for reviewers. If you're interested, please fill out this form: https://t.co/8X7HKD43Ia More information can be found on our website: https://t.co/nRRhNR1NUj
2
11
27