
Sihyun Yu
@sihyun_yu
Followers
1K
Following
1K
Media
13
Statuses
164
Visiting Scholar @NYU_Courant Intern @NVIDIAAI | PhD @ KAIST | Ex-intern @NVIDIAAI and @GoogleAI | Generative models | https://t.co/wTvMmsjUdG
Daejeon
Joined July 2020
Introducing REPA! We show that learning high-quality representations in diffusion transformers is crucial for boosting generation performance. With REPA, we speed up SiT training by 17.5x (without CFG) and achieve state-of-the-art FID = 1.42 using CFG with the guidance interval.
6
46
285
Introducing Representation Autoencoders (RAE)! We revisit the latent space of Diffusion Transformers, replacing VAE with RAE: pretrained representation encoders (DINOv2, SigLIP2) paired with trained ViT decoders. (1/n)
5
44
375
three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. š(1/n)
37
223
1K
Excited to introduce DiffuseNNX, a comprehensive JAX/Flax NNX-based library for diffusion and flow matching! It supports multiple diffusion / flow-matching frameworks, Autoencoders, DiT variants, and sampling algorithms. Repo: https://t.co/zOcA6nyrcM Delve into details below!
github.com
A comprehensive JAX/NNX library for diffusion and flow matching generative algorithms, featuring DiT (Diffusion Transformer) and its variants as the primary backbone with support for ImageNet train...
3
42
156
I spent the past month reimplementing DeepMindās Genie 3 world model from scratch Ended up making TinyWorlds, a 3M parameter world model capable of generating playable game environments demo below + everything I learned in thread (full repo at the end)šš¼
97
267
2K
I know op is click-baiting, but let me bite... fwiw every researcherās DREAM is to find out their architecture is wrong. If itās never wrong, thatās a bigger problem. we try to break DiT every day w/ SiT, REPA, REPA-E etc. but you gotta form hypotheses, run experiments, test, not
bros, DiT is wrong. it's mathematically wrong. it's formally wrong. there is something wrong with it
12
56
544
Enhancing Motion Dynamics of Image-to-Video Models via Adaptive Low-Pass Guidance
3
21
185
š Come check our poster at ICML @genbio_workshop! We show that pretrained MLIPs can accelerate training of Boltzmann emulators ā by aligning their internal representations. Coauthors @LucasPinede, @junonam_, @RGBLabMIT (1/n)
2
18
151
Iāve wondered why I2V models tend to generate more static videos compared to their T2V counterparts. This project, led by @june_suk_choi, provides an analysis of this phenomenon and introduces a very simple (yet effective) fix to address it! Excited to have been part of this
Excited to share Adaptive Low-Pass Guidance (ALG): a simple training-free, drop-in fix that brings dynamic motion back to Image-to-Video models! Demo videos, paper, & code below! https://t.co/4NzYDfCFSb (š§µ 1/7)
0
2
29
@joserf28323 @CVPR @ICCVConference @nyuniversity Thanks for bringing this to my attention. I honestly wasnāt aware of the situation until the recent posts started going viral. I would never encourage my students to do anything like thisāif I were serving as an Area Chair, any paper with this kind of prompt would be
10
28
215
Excited to share MDMs for molecule generation led by @bellaseo72 and @taewonKKK!
Meet MELD: a masked diffusion model (MDMs) designed for de novo molecule generation. MELD assigns per-element learnable noise schedule that tailors noise at the atom & bond level to avoid state-clashing problem. With MELD we achieve state-of-the-art property alignment in
0
1
11
now the code is up here:
github.com
JAX implementation of MeanFlow. Contribute to Gsunshine/meanflow development by creating an account on GitHub.
Excited to share our work with my amazing collaborators, @Goodeat258, @SimulatedAnneal, @zicokolter, and Kaiming. In a word, we show an āidentity learningā approach for generative modeling, by relating the instantaneous/average velocity in an identity. The resulting model,
2
17
71
We introduce LiveCodeBench Pro. Models like o3-high, o4-mini, and Gemini 2.5 Pro score 0% on hard competitive programming problems.
5
28
191
The slides for my CVPR talks are now available at
latentspace.cc
Arash Vahdat is a Research Director, leading the fundamental generative AI research (GenAIR) team at NVIDIA Research. Before joining NVIDIA, he was a research scientist at D-Wave Systems where he...
I'm giving 3 talks at #CVPR2025 workshops and tutorials: 1⣠"Rare Yet Real: Generative Modeling Beyond the Modes" will cover some of our work on gen AI for science where tail modeling and predictor calibration are crucial (Wed 11:10 - Room 102 B). https://t.co/IqDuwOXY2W
3
19
163
Padding in our non-AR sequence models? Yuck. š
š Instead of unmasking, our new work *Edit Flows* perform iterative refinements via position-relative inserts and deletes, operations naturally suited for variable-length sequence generation. Easily better than using mask tokens.
8
80
518
Had a great time at this CVPR community-building workshop---lots of fun discussions and some really important insights for early-career researchers. I also gave a talk on "Research as an Infinite Game." Here are the slides: https://t.co/T5FZS1A3CT
In this #CVPR2025 edition of our community-building workshop series, we focus on supporting the growth of early-career researchers. Join us tomorrow (Jun 11) at 12:45 PM in Room 209 Schedule: https://t.co/1fKzplQrU5 We have an exciting lineup of invited talks and candid
18
66
353
Join us for a full-day tutorial on Scalable Generative Models in Computer Vision at @CVPR in Nashville, on Wednesday, June 11, from 9:00 AM to 5:00 PM in Room 202 B! š We are honored to have @sainingxie, @deeptigp, @thoma_gu, Kaiming He, @ArashVahdat, and @sherryyangML to
2
21
81
Excited to present FastTD3: a simple, fast, and capable off-policy RL algorithm for humanoid control -- with an open-source code to run your own humanoid RL experiments in no time! Thread below š§µ
15
118
563
Indeed. For text-to-image, @xichen_pan had a great summary supporting this decoupled design philosophy: "Render unto diffusion what is generative, and unto LLMs what is understanding." We've repeatedly observed that diffusion gradients can negatively impact the backbone repr.
as expected, this matches findings in unified multimodal understanding and generation models by @sainingxie: frozen VLM might help you. https://t.co/AwGBiNdN6R
12
36
227