_rohitgirdhar_ Profile Banner
Rohit Girdhar Profile
Rohit Girdhar

@_rohitgirdhar_

Followers
2K
Following
486
Media
24
Statuses
96

Research Scientist at Meta GenAI

New York
Joined September 2018
Don't wanna be here? Send us removal request.
@_rohitgirdhar_
Rohit Girdhar
9 months
Super excited to share MovieGen: new SOTA media generation system! When we started, I didn’t think we’d get this far this quickly. But turns out a simplified approach (flow matching) paired with scaling up model size and data, indeed works amazingly well! Details in the paper 😀.
@AIatMeta
AI at Meta
9 months
🎥 Today we’re premiering Meta Movie Gen: the most advanced media foundation models to-date. Developed by AI research teams at Meta, Movie Gen delivers state-of-the-art results across a range of capabilities. We’re excited for the potential of this line of research to usher in
2
6
69
@_rohitgirdhar_
Rohit Girdhar
24 days
RT @CMHungSteven: @CVPR is around the corner!!.Join us at the Workshop on T4V at #CVPR2025 with a great speaker lineup (@MikeShou1, @jw2yan….
0
19
0
@_rohitgirdhar_
Rohit Girdhar
5 months
And check out another paper we just put online: DiTo! A new image/video tokenization approach, trained purely using diffusion, modernizing the tokenization pipeline and making it a lot simpler and scalable!.
@YinboChen
Yinbo Chen
5 months
Introducing “Diffusion Autoencoders are Scalable Image Tokenizers” (DiTo). We show that with proper designs and scaling up, diffusion autoencoders (a single L2 loss) can outperform the GAN-LPIPS tokenizers (hybrid losses) used in current SOTA generative models. (1/4)
Tweet media one
0
5
31
@_rohitgirdhar_
Rohit Girdhar
5 months
Joint work with an all-star team: @chargedneutron_ @YGandelsman @endernewton and @imisra_!.
0
0
1
@_rohitgirdhar_
Rohit Girdhar
5 months
And that’s not all! It performed surprisingly competitively on image/video/audio captioning, and could even perform style transfer and cross-modal arithmetic. Check all the details in our paper: And our code:
2
3
24
@_rohitgirdhar_
Rohit Girdhar
5 months
I was particularly excited by this result where we used an image quality model (“PickScore”) as the "scorer", hooked up an LLM to a text-to-image (T2I) model as the "generator", and MILS figured out better prompts for the T2I model, to generate nicer looking images!
Tweet media one
1
0
12
@_rohitgirdhar_
Rohit Girdhar
5 months
It does so using test-time optimization: LLM generates candidates that are scored by an off-the-shelf embedding similarity model like CLIP. The scores are then fed back into the LLM, which then generates the next (better) set of candidates, eventually generating the final output.
Tweet media one
2
0
7
@_rohitgirdhar_
Rohit Girdhar
5 months
Super excited to share some recent work that shows that pure, text-only LLMs, can see and hear without any training! Our approach, called "MILS", uses LLMs with off-the-shelf multimodal models, to caption images/videos/audio, improve image generation, style transfer, and more!
Tweet media one
7
38
246
@_rohitgirdhar_
Rohit Girdhar
7 months
RT @dtrinh: VERY excited about the era of generative AR we're bringing to life. Check out this preview!. It's early but so damn promising —….
0
18
0
@_rohitgirdhar_
Rohit Girdhar
9 months
Check out this result (page 25~26) and more in the arxiv:
0
0
3
@_rohitgirdhar_
Rohit Girdhar
9 months
MovieGen is now on arXiv, with some interesting new tidbits! I’m particularly excited about this scaling analysis, where we find that the optimal FLOPs/params for MovieGen lie on the Llama3 scaling law, suggesting that LLM scaling laws might even work for media generation models!
Tweet media one
1
8
78
@_rohitgirdhar_
Rohit Girdhar
9 months
Cc @GalChechik since you were wondering what we’d been up to since the emu video work we were just talking about at ECCV 😊.
1
0
2
@_rohitgirdhar_
Rohit Girdhar
9 months
Starting now!
Tweet media one
2
0
24
@_rohitgirdhar_
Rohit Girdhar
9 months
Indeed! Come talk to @imisra_ and myself about Emu Video ( at the #ECCV2024 poster session at 10:30AM 😀. or maybe there's more. ? 🤔.
@Ahmad_Al_Dahle
Ahmad Al-Dahle
9 months
Looking forward to tomorrow … 👀.
1
1
40
@_rohitgirdhar_
Rohit Girdhar
1 year
Check out all the details and comparisons to many competing multimodal models in the full paper!
0
0
1
@_rohitgirdhar_
Rohit Girdhar
1 year
Excited to share llama3.1, that brings multimodal capabilities to your favorite open source LLM using simple, post-trained adapters! Great experience building w/ our incredible multimodal team, and espl my partners in crime for all things video, @mannat_singh and @filipradenovic!.
@AIatMeta
AI at Meta
1 year
Starting today, open source is leading the way. Introducing Llama 3.1: Our most capable models yet. Today we’re releasing a collection of new Llama 3.1 models including our long awaited 405B. These models deliver improved reasoning capabilities, a larger 128K token context
2
4
61
@_rohitgirdhar_
Rohit Girdhar
1 year
@xiaolonw @liliyu_lili And the final talk of the day by @p_bojanowski on self supervised learning of vision transformers!
Tweet media one
1
0
0
@_rohitgirdhar_
Rohit Girdhar
1 year
@xiaolonw @liliyu_lili is now talking about multimodal transformer architectures!
Tweet media one
1
0
3
@_rohitgirdhar_
Rohit Girdhar
1 year
Another packed talk by @xiaolonw on test time learning!
Tweet media one
2
0
2