Rohit Girdhar @_rohitgirdhar_ X Profile

Rohit Girdhar

@_rohitgirdhar_

Followers

2K

Following

499

Media

24

Statuses

97

Research Scientist at Meta GenAI

New York

Joined September 2018

Don't wanna be here? Send us removal request.

Rohit Girdhar

@_rohitgirdhar_

11 months

Super excited to share MovieGen: new SOTA media generation system! When we started, I didn’t think we’d get this far this quickly. But turns out a simplified approach (flow matching) paired with scaling up model size and data, indeed works amazingly well! Details in the paper 😀.

AI at Meta

@AIatMeta

11 months

🎥 Today we’re premiering Meta Movie Gen: the most advanced media foundation models to-date. Developed by AI research teams at Meta, Movie Gen delivers state-of-the-art results across a range of capabilities. We’re excited for the potential of this line of research to usher in

2

6

69

Rohit Girdhar

@_rohitgirdhar_

3 months

RT @CMHungSteven: @CVPR is around the corner!!.Join us at the Workshop on T4V at #CVPR2025 with a great speaker lineup (@MikeShou1, @jw2yan….

0

19

0

Grok

@grok

16 hours

Join millions who have switched to Grok.

30

75

341

Rohit Girdhar

@_rohitgirdhar_

7 months

And check out another paper we just put online: DiTo! A new image/video tokenization approach, trained purely using diffusion, modernizing the tokenization pipeline and making it a lot simpler and scalable!.

Yinbo Chen

@YinboChen

7 months

Introducing “Diffusion Autoencoders are Scalable Image Tokenizers” (DiTo). We show that with proper designs and scaling up, diffusion autoencoders (a single L2 loss) can outperform the GAN-LPIPS tokenizers (hybrid losses) used in current SOTA generative models. (1/4)

0

5

31

Rohit Girdhar

@_rohitgirdhar_

7 months

Joint work with an all-star team: @chargedneutron_ @YGandelsman @endernewton and @imisra_!.

0

1

Rohit Girdhar

@_rohitgirdhar_

7 months

And that’s not all! It performed surprisingly competitively on image/video/audio captioning, and could even perform style transfer and cross-modal arithmetic. Check all the details in our paper: And our code:

github.com

Code release for "LLMs can see and hear without any training" - facebookresearch/MILS

2

3

24

Rohit Girdhar

@_rohitgirdhar_

7 months

I was particularly excited by this result where we used an image quality model (“PickScore”) as the "scorer", hooked up an LLM to a text-to-image (T2I) model as the "generator", and MILS figured out better prompts for the T2I model, to generate nicer looking images!

1

0

12

Rohit Girdhar

@_rohitgirdhar_

7 months

It does so using test-time optimization: LLM generates candidates that are scored by an off-the-shelf embedding similarity model like CLIP. The scores are then fed back into the LLM, which then generates the next (better) set of candidates, eventually generating the final output.

2

0

7

Rohit Girdhar

@_rohitgirdhar_

7 months

Super excited to share some recent work that shows that pure, text-only LLMs, can see and hear without any training! Our approach, called "MILS", uses LLMs with off-the-shelf multimodal models, to caption images/videos/audio, improve image generation, style transfer, and more!

7

38

245

Rohit Girdhar

@_rohitgirdhar_

8 months

RT @dtrinh: VERY excited about the era of generative AR we're bringing to life. Check out this preview!. It's early but so damn promising —….

0

18

0

Rohit Girdhar

@_rohitgirdhar_

10 months

Check out this result (page 25~26) and more in the arxiv:

arxiv.org

We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as...

0

3

Rohit Girdhar

@_rohitgirdhar_

10 months

MovieGen is now on arXiv, with some interesting new tidbits! I’m particularly excited about this scaling analysis, where we find that the optimal FLOPs/params for MovieGen lie on the Llama3 scaling law, suggesting that LLM scaling laws might even work for media generation models!

1

8

78

Rohit Girdhar

@_rohitgirdhar_

11 months

Cc @GalChechik since you were wondering what we’d been up to since the emu video work we were just talking about at ECCV 😊.

1

0

2

Rohit Girdhar

@_rohitgirdhar_

11 months

Starting now!

2

0

24

Rohit Girdhar

@_rohitgirdhar_

11 months

Indeed! Come talk to @imisra_ and myself about Emu Video ( at the #ECCV2024 poster session at 10:30AM 😀. or maybe there's more. ? 🤔.

emu-video.metademolab.com

Factorizing Text-to-Video Generation by Explicit Image Conditioning

Ahmad Al-Dahle

@Ahmad_Al_Dahle

11 months

Looking forward to tomorrow … 👀.

1

40

Rohit Girdhar

@_rohitgirdhar_

1 year

Check out all the details and comparisons to many competing multimodal models in the full paper!

0

1

Rohit Girdhar

@_rohitgirdhar_

1 year

Excited to share llama3.1, that brings multimodal capabilities to your favorite open source LLM using simple, post-trained adapters! Great experience building w/ our incredible multimodal team, and espl my partners in crime for all things video, @mannat_singh and @filipradenovic!.

AI at Meta

@AIatMeta

1 year

Starting today, open source is leading the way. Introducing Llama 3.1: Our most capable models yet. Today we’re releasing a collection of new Llama 3.1 models including our long awaited 405B. These models deliver improved reasoning capabilities, a larger 128K token context