Seungwoo (Simon) Kim @SeKim1112 X Profile

Seungwoo (Simon) Kim

@SeKim1112

Followers

38

Following

22

Media

5

Statuses

16

cs/ai @ stanford

Stanford, CA

Joined March 2025

Don't wanna be here? Send us removal request.

Seungwoo (Simon) Kim

@SeKim1112

19 days

A very interesting idea for segmentation that aligns with a fundamental concept of objects!.

Rahul Venkatesh

@Rahul_Venkatesh

19 days

AI models segment scenes based on how things appear, but babies segment based on what moves together. We utilize a visual world model that our lab has been developing, to capture this concept — and what's cool is that it beats SOTA models on zero-shot segmentation and physical

0

4

Seungwoo (Simon) Kim

@SeKim1112

26 days

We see KL-tracing ( as the simpler, more general recipe for zero-shot flow (and other intermediates) whenever the base model offers fine-grained control, e.g. LRAS. Excited to see what other interesting works in the future will use generative video models.

arxiv.org

Extracting optical flow from videos remains a core computer vision problem. Motivated by the success of large general-purpose models, we ask whether frozen self-supervised video models trained...

0

1

Grok

@grok

1 day

Generate videos in just a few seconds. Try Grok Imagine, free for a limited time.

488

237

3K

Seungwoo (Simon) Kim

@SeKim1112

26 days

(2) Like our ablations, they find that squeezing an entire clip into a single latent bottleneck wipes out fine motion. Their more involved strategy to establish one-to-one mapping between each frame and its latent boosts accuracy significantly. BUT it still trails supervised flow

1

0

1

Seungwoo (Simon) Kim

@SeKim1112

26 days

Concurrent work alert! DiffTrack ( (@jisu__nam, @JunhwaHur, @KimSeungry62571, et al.) is a super cool paper that tackles the same puzzle we do: can you pull out useful signals from a generative video model with zero labels? Their trick is to probe.

arxiv.org

Recent advancements in video diffusion models based on Diffusion Transformers (DiTs) have achieved remarkable success in generating temporally coherent videos. Yet, a fundamental question...

1

2

9

Seungwoo (Simon) Kim

@SeKim1112

26 days

RT @dyamins: Over the past 18 months my lab has been developing a new approach to visual world modeling. There will be a magnum opus that t….

0

14

0

Seungwoo (Simon) Kim

@SeKim1112

27 days

RT @khai_loong_aw: So excited by this direction of using generative video models for vision tasks. Here we show it for extracting optical f….

0

2

0

Seungwoo (Simon) Kim

@SeKim1112

27 days

RT @KlemenKotar: 📷 New Preprint: SOTA optical flow extraction from pre-trained generative video models! While it seems intuitive that video….

0

9

0

Seungwoo (Simon) Kim

@SeKim1112

27 days

Our final method, KL-tracing with LRAS, achieves state-of-the-art results on the TAP-Vid benchmark, as well as qualitatively accurate flow-traces on challenging scenes, even compared to supervised, task-specific baselines such as SEA-RAFT.

0

2

Seungwoo (Simon) Kim

@SeKim1112

27 days

FINALLY: KL-tracing works by computing KL-divergence between clean & perturbed logit distributions. This is a powerful *statistical counterfactual* probe enabled by autoregressive generative predictors (like LRAS).

1

3

Seungwoo (Simon) Kim

@SeKim1112

27 days

BUT: the recent Local Random Access Sequence (LRAS) ( model turns out to work great as the base generative model b/c: (a) its local tokenizer enables detailed perturbation control and (b) random access decoding order allows for stronger conditioning.

arxiv.org

3D scene understanding from single images is a pivotal problem in computer vision with numerous downstream applications in graphics, augmented reality, and robotics. While diffusion-based modeling...

1

0

2

Seungwoo (Simon) Kim

@SeKim1112

27 days

and . we find that applying the perturbation-tracking procedure to strong video models like Stable Video Diffusion & Cosmos ALSO fails because those models lack sufficiently fine-grained controllability.

1

0

1

Seungwoo (Simon) Kim

@SeKim1112

27 days

Previously, Counterfactual World Modeling (CWM) ( introduced an intuitive procedure for zero-shot flow: apply a small perturbation on the query point, and track where it moves by computing the difference between clean & perturbed predictions. But.

arxiv.org

Leading approaches in machine vision employ different architectures for different tasks, trained on costly task-specific labeled datasets. This complexity has held back progress in areas, such as...

1

0

2

Seungwoo (Simon) Kim

@SeKim1112

27 days

Generative models capture understanding of the world from their large-scale pre-training data, but how do you extract useful visual quantities from these models zero-shot w/o task-specific fine-tuning? This is esp. important for quantities (like optical flow) where getting dense.

1

0

2

Seungwoo (Simon) Kim

@SeKim1112

27 days

We prompt a generative video model to extract state-of-the-art optical flow, using zero labels and no fine-tuning. Our method, KL-tracing, achieves SOTA results on TAP-Vid & generalizes to challenging YouTube clips. @khai_loong_aw @KlemenKotar @CristbalEyzagu2 @lee_wanhee_

1

8

30

Seungwoo (Simon) Kim

@SeKim1112

3 months

RT @percyliang: AI agents have the potential to significantly alter the cybersecurity landscape. To help us understand this change, we are….

0

31

0

Seungwoo (Simon) Kim

@SeKim1112

5 months

RT @dyamins: New paper on self-supervised optical flow and occlusion estimation from video foundation models. @sstj389 @jiajunwu_cs @SeKim….

0

18

0