sainingxie Profile Banner
Saining Xie Profile
Saining Xie

@sainingxie

Followers
20K
Following
5K
Media
55
Statuses
495

researcher in #deeplearning #computervision | assistant professor at @NYU_Courant @nyuniversity | previous: research scientist @metaai (FAIR) @UCSanDiego

Joined July 2020
Don't wanna be here? Send us removal request.
@sainingxie
Saining Xie
1 year
Here's my take on the Sora technical report, with a good dose of speculation that could be totally off. First of all, really appreciate the team for sharing helpful insights and design decisions – Sora is incredible and is set to transform the video generation community. What we
Tweet media one
42
537
3K
@sainingxie
Saining Xie
7 months
Representation matters. Representation matters. Representation matters, even for generative models. We might've been training our diffusion models the wrong way this whole time. Meet REPA: Training Diffusion Transformers is easier than you think! 🧵1/n)
Tweet media one
29
267
2K
@sainingxie
Saining Xie
8 months
During my internship at DeepMind, Demis met with all the interns. When asked about the company’s goal, I vividly remember him saying, “winning *multiple* Nobel prizes.” I was shocked at the time, but now, just 7 years later, part of that mission is already accomplished. Eager to.
@NobelPrize
The Nobel Prize
8 months
BREAKING NEWS.The Royal Swedish Academy of Sciences has decided to award the 2024 #NobelPrize in Chemistry with one half to David Baker “for computational protein design” and the other half jointly to Demis Hassabis and John M. Jumper “for protein structure prediction.”
Tweet media one
10
112
2K
@sainingxie
Saining Xie
11 months
Introducing Cambrian-1, a fully open project from our group at NYU. The world doesn't need another MLLM to rival GPT-4V. Cambrian is unique as a vision-centric exploration & here's why I think it's time to shift focus from scaling LLMs to enhancing visual representations.🧵[1/n]
Tweet media one
17
249
1K
@sainingxie
Saining Xie
8 months
Is this now about gravity? 😶
Tweet media one
49
44
883
@sainingxie
Saining Xie
1 year
🔍Introducing V*: exploring guided visual search in multimodal LLMs. MLLMs like GPT4V & LLaVA are amazing, but one concern that keeps me up at night: the (frozen) visual encoder typically extracts global image tokens *only once*, regardless of resolution or scene complexity (1/n)
Tweet media one
15
116
751
@sainingxie
Saining Xie
1 year
well diffusion transformer was rejected at CVPR 2023 due to limited novelty.
@jbhuang0604
Jia-Bin Huang
1 year
R2: While the results are impressive, this is a simple combination of diffusion transformer (ICCV 2023) and latent diffusion model (CVPR 2022). Limited novelty. Weak reject.
19
50
702
@sainingxie
Saining Xie
5 months
Video understanding is the next frontier, but not all videos are alike. Models now reason over youtube clips and feature films, but what about the everyday spaces we—and our future AI assistants—navigate and experience?.Introducing Thinking in Space, our latest study exploring
17
107
679
@sainingxie
Saining Xie
3 years
A new chapter of my professional life! After 4 incredible years at FAIR and living in the bay, I’m moving to NYC! I’ll be joining @NYU_Courant CS @NYUniversity @CILVRatNYU as an Assistant Professor the upcoming Jan. Looking for students/postdocs to join me on this new adventure!!
Tweet media one
53
23
619
@sainingxie
Saining Xie
21 days
Wow, Deeply Supervised Nets received the Test of Time award at @aistats_conf 2025! It was the very first paper I submitted during my PhD. Fun fact: the paper was originally rejected by NeurIPS with scores of 8/8/7 (yes, that pain stuck with me. maybe now I can finally let it.
@aistats_conf
AISTATS Conference
21 days
The #AISTATS 2025 Test of Time Award goes to . 🥁 . Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, Zhuowen Tu, for "Deeply Supervised Nets"! Congratulations!
Tweet media one
33
43
508
@sainingxie
Saining Xie
4 months
When I first saw diffusion models, I was blown away by how naturally they scale during inference: you train them with fixed flops, but during test time, you can ramp it up by like 1,000x. This was way before it became a big deal with o1. But honestly, the scaling isn’t that.
@ma_nanye
Willis (Nanye) Ma
4 months
Inference-time scaling for LLMs drastically improves the model's ability in many perspectives, but what about diffusion models?.In our latest study—Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps—we reframe inference-time scaling as a search problem
Tweet media one
9
69
478
@sainingxie
Saining Xie
1 year
Diffusion Transformer architecture + Flow Matching / Stochastic Interpolants objective? Great work and looking forward to the technical report!. In SiT ( we have also studied this new design space under class conditional generation (though on a much.
@StabilityAI
Stability AI
1 year
Announcing Stable Diffusion 3, our most capable text-to-image model, utilizing a diffusion transformer architecture for greatly improved performance in multi-subject prompts, image quality, and spelling abilities. Today, we are opening the waitlist for early preview. This phase
Tweet media one
5
48
350
@sainingxie
Saining Xie
1 year
When Bill and I were working on the DiT project, instead of creating novelty (see my last tweet🤷‍♂️), we prioritized two aspects: simplicity and scalability. These priorities offer more than just conceptual advantages. - Simplicity means flexibility. The cool thing about vanilla
Tweet media one
Tweet media two
7
43
406
@sainingxie
Saining Xie
2 months
i think this just shows the input image passes through semantic encoders instead of VQ; they're aligned with the LLM and grasp image content well (super important for editing) but may not perfectly reconstruct original pixels (due to capacity limits / # image tokens).
@DimitrisPapail
Dimitris Papailiopoulos
2 months
ChatGPT insists on swapping out real faces for fake ones, when requested to generate an identical image to the input. Below (input, output)
Tweet media one
Tweet media two
16
25
386
@sainingxie
Saining Xie
1 year
Multimodal LLMs have been shown to err in complex, OOD, and edge-case scenarios. Yet, we have identified a systematic method for pinpointing visual errors in these models even when they are posed with *very basic* questions, using just common images from ImageNet and LAION. 🧵
Tweet media one
Tweet media two
Tweet media three
Tweet media four
8
69
362
@sainingxie
Saining Xie
1 year
thought experiment: ViTs work great for 224^2 images, but what if you had a 1 million^2 pixel one? You'd either use conv, or you patchify and process each with a ViT using shared weights—essentially conv. a moment I realize convnet isn't an architecture; it's a way of thinking.
@ylecun
Yann LeCun
1 year
A short post on the best architectures for real-time image and video processing. TL;DR: use convolutions with stride or pooling at the low levels, and stick self-attention circuits at higher levels, where feature vectors represent objects. PS: ready to bet that Tesla FSD uses.
14
26
339
@sainingxie
Saining Xie
2 months
wait a sec. look at the content -- did y'all actually go this route? This looks way too plausible, and honestly the most practical approach on multimodal gen rn (based on my own experience with students). So, not pure AR, but an LLM + a diffusion "renderer" on the compressed.
@ajabri
Allan Jabri
2 months
the pros and cons
Tweet media one
10
18
349
@sainingxie
Saining Xie
1 year
🌎 𝕤𝕒𝕪 𝕙𝕖𝕝𝕝𝕠 𝕥𝕠 𝕧𝕚𝕣𝕝 🌏.
10
64
326
@sainingxie
Saining Xie
1 year
Diffusion Transformer (DiT) just got an upgrade! .Same backbone but better quality, speed and flexibility. And we achieved this by. moving beyond standard diffusion and exploring a broader design space with interpolants!. Introducing SiT -- Scalable Interpolant Transformers!.
@_akhaliq
AK
1 year
NYU presents SiT. Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers. paper page: present Scalable Interpolant Transformers (SiT), a family of generative models built on the backbone of Diffusion Transformers
Tweet media one
5
40
300
@sainingxie
Saining Xie
1 year
The key takeaway is from the "Emerging simulation capabilities" section. Before Sora, it was unclear if long form consistency could emerge on its own or if it required complex subject-driven generation pipelines or even physics simulators. OpenAI has shown that, though not.
11
35
273
@sainingxie
Saining Xie
6 months
Nothing ever happens… until it does! . Just saw JanusFlow by @deepseek_ai uses REPA for training and shows some solid improvements: .
Tweet media one
@sainingxie
Saining Xie
7 months
@cloneofsimo hmm I’ve got a bunch of independent data points now showing that REPA helps with big text-to-image models too! Let’s give it a little more time before saying nothing ever happens—I’m sure it won’t be long :).
4
31
261
@sainingxie
Saining Xie
1 month
Our take on a 4o-style AR + diffusion unified model: Transferring knowledge from an AR LLM to generation is easier than expected--you don't even need to touch the LLM. The right bridge between output modalities can unlock cool capabilities like knowledge-augmented generation!.
@xichen_pan
Xichen Pan
1 month
We find training unified multimodal understanding and generation models is so easy, you do not need to tune MLLMs at all. MLLM's knowledge/reasoning/in-context learning can be transferred from multimodal understanding (text output) to generation (pixel output) even it is FROZEN!
Tweet media one
4
29
262
@sainingxie
Saining Xie
1 month
About two years ago, we started building V* to bring visual search into a multimodal LLM and show that it's a key part of how these models can understand the world. I still remember talking with my friend @bowenc0221 and @_alex_kirillov_ about why this.
@sainingxie
Saining Xie
1 year
🔍Introducing V*: exploring guided visual search in multimodal LLMs. MLLMs like GPT4V & LLaVA are amazing, but one concern that keeps me up at night: the (frozen) visual encoder typically extracts global image tokens *only once*, regardless of resolution or scene complexity (1/n)
Tweet media one
2
27
250
@sainingxie
Saining Xie
2 months
In Cambrian-1, we found that vision SSL representations usually lagged behind language-supervised ones -- but once the data gap is closed and scaling kicks in, performance catches up. We’ve tried scaling SSL before, but this is the first time I’ve seen real signal: SSL adapts to.
@DavidJFan
David Fan
2 months
Can visual SSL match CLIP on VQA?. Yes! We show with controlled experiments that visual SSL can be competitive even on OCR/Chart VQA, as demonstrated by our new Web-SSL model family (1B-7B params) which is trained purely on web images – without any language supervision.
Tweet media one
2
51
246
@sainingxie
Saining Xie
1 month
Some further thoughts on the idea of "thinking with images":. 1) zero-shot tool use is limited -- you can’t just call an object detector to do visual search. That’s why approaches like VisProg/ViperGPT/Visual-sketchpad will not generalize or scale well. 2) visual search needs to
Tweet media one
@mckbrando
Brandon McKinzie
1 month
@WenhuChen @WeijiaShi2 at a glance, this is quite diff than what we did fwiw. the behaviors you see in o3 and o4-mini are all emergent from large-scale RL. we just give them access to python and the ability to manipulate images, the rest is up to the model.
5
32
242
@sainingxie
Saining Xie
1 year
#shamelessplug DiT shines in Sora. Our team at NYU has recently released a new DiT model, called SiT. It has exactly the same architecture, but offers enhanced performance and faster convergence. Super curious about its performance on video generation too! (n/n).
4
21
229
@sainingxie
Saining Xie
2 years
Awesome plot showing the progress in CLIP-like model training! As both a user and a researcher, there are a couple of caveats I personally feel worth pointing out.
@gabriel_ilharco
Gabriel Ilharco
2 years
CLIP models have become a lot better since 2021
Tweet media one
1
14
212
@sainingxie
Saining Xie
1 year
I like pretty pictures but one thing I like more about diffusion models is how they open up new doors for (useful) analysis-by-synthesis approaches. More and more research is showing that (pre-trained) diffusion models are pretty good feature extractors too. This empirical study.
@_akhaliq
AK
1 year
Meta presents Deconstructing Denoising Diffusion Models for Self-Supervised Learning. paper page: examine the representation learning abilities of Denoising Diffusion Models (DDM) that were originally purposed for image generation. Our philosophy is to
Tweet media one
5
26
214
@sainingxie
Saining Xie
5 years
Almost every deep learning model for 3D recognition has been *trained from scratch*. In our #ECCV2020 spotlight paper, we propose 👉PointContrast👈, an unsupervised pre-training framework that boosts performance on 6 different 3D point cloud benchmarks.
Tweet media one
Tweet media two
Tweet media three
3
40
201
@sainingxie
Saining Xie
4 months
two exciting directions for diffusion models in 2025: either going (extremely) small or going (extremely) big with your steps.
5
11
191
@sainingxie
Saining Xie
7 months
@jbhuang0604 there's no true self-supervised learning in text - it's (strongly) supervised learning.
7
4
152
@sainingxie
Saining Xie
1 year
Really enjoyed working on this project; some thoughts on why I believe combining the creative freedom of generative models with the precision of the 3D graphics pipeline could be the future. (1/n)🧵.
@_akhaliq
AK
1 year
Intel and NYU present Image Sculpting. Precise Object Editing with 3D Geometry Control. paper page: present Image Sculpting, a new framework for editing 2D images by incorporating tools from 3D geometry and graphics. This approach differs markedly from
3
17
142
@sainingxie
Saining Xie
7 months
When I saw these results, it didn’t feel like we invented something entirely new—it felt more like we barely understand the representations learned from diffusion models and SSL methods. This has many implications for building a true world model. Plus, we still need new,.
4
4
140
@sainingxie
Saining Xie
1 month
The pre-trained, frozen VAE in DiT is massive (much higher flops than the transformer itself!!). Do you have to freeze it? What if you could leverage that capacity through e2e training? .It turns out it doesn't play well with diffusion loss -- but with REPA, you can make it work!.
@1jaskiratsingh
Jaskirat Singh
1 month
Can we optimize both the VAE tokenizer and diffusion model together in an end-to-end manner? Short Answer: Yes. 🚨 Introducing REPA-E: the first end-to-end tuning approach for jointly optimizing both the VAE and the latent diffusion model using REPA loss 🚨. Key Idea:.🧠
Tweet media one
2
16
142
@sainingxie
Saining Xie
11 months
Here's a bit of reflection: when I moved from industry to academia, I wasn't sure if we'd ever be able to pull off a large-scale project like this that requires full-stack skills. The students amazed me with their dedication and courage. Our team, with PhDs, masters, and.
7
6
116
@sainingxie
Saining Xie
5 months
Check out the latest paper from @TongPetersb on grokking both visual understanding and generation abilities through (modest scale) instruction tuning. The data composition reveals both asymmetry and prioritization: we represent to understand, and we understand to create.
@liuzhuang1234
Zhuang Liu
5 months
How far is an LLM from not only understanding but also generating visually?. Not very far!. Introducing MetaMorph---a multimodal understanding and generation model. In MetaMorph, understanding and generation benefit each other. Very moderate generation data is needed to elicit
Tweet media one
1
11
113
@sainingxie
Saining Xie
1 year
It was a blast seeing everyone at nyu and getting to learn about all the cool work. this is why nyc (and surroundings) are a great place for computer vision 😊.
@DandanShan_
Dandan Shan
1 year
Fresh memories from 🗽NYC Vision Day🗽 hosted at NYU yesterday, April 1st (. Grateful to the organizers (David Fouhey, @jiadeng, @orussakovsky, @Jimantha, @cvondrick @sainingxie) for setting up such an amazing event to bring the vision community together!
Tweet media one
Tweet media two
Tweet media three
Tweet media four
3
3
113
@sainingxie
Saining Xie
7 months
team industry -1, team academia +1 🎉😉. students, here’s your chance to join an amazing lab ⬇️.
@liuzhuang1234
Zhuang Liu
7 months
Excited to share that I will be joining Princeton Computer Science @PrincetonCS as an Assistant Professor in September 2025!. I'm looking for students to join me. If you are interested in working with me on VLMs, LLMs, deep learning (vision/LLM) architectures, data, training,.
1
1
104
@sainingxie
Saining Xie
11 months
From our previous projects (MMVP, V*, VIRL), we've noticed unexpected visual shortcomings in current MLLM systems. While we can temporarily fix issues by e.g. adding data, one root problem is that our visual representations are not yet sufficient for language understanding. In
Tweet media one
2
10
98
@sainingxie
Saining Xie
1 year
Check out our paper for more details: Visit our project page: Code and models are also available: Big shoutout to our intern @penghaowu2 for the heroic work! (n/n).
4
17
98
@sainingxie
Saining Xie
7 months
People (in academia) always tell me that training DiTs/SiTs is way too hard because it takes 7M iters and weeks to get the FID we reported in the paper. We figured out how to speed up training by ~18X, hitting even better FID in less than 400K iters. We did this by digging into.
2
3
97
@sainingxie
Saining Xie
4 months
yup more multimodal more fun :).
@_alex_kirillov_
Alexander Kirillov
4 months
Career update: after an amazing journey at OpenAI I left to join something new, exciting, and multimodal!.
0
3
95
@sainingxie
Saining Xie
1 year
Unsung hero for academia.
@ericjang11
Eric Jang
1 year
everybody talks big game about democratizing AI, but today I'm super grateful that @GoogleColab gives free TPU + GPU instances. Makes it possible for a lot of students to learn about ML without spending a few thousand dollars building a deep learning PC.
0
3
92
@sainingxie
Saining Xie
7 months
Now things become natural: we can supercharge diffusion transformer training by adding a simple regularization that maximizes the similarity between the diffusion transformer's latent representation and a powerful external visual representation like DINOv2. This simple tweak
Tweet media one
Tweet media two
4
6
92
@sainingxie
Saining Xie
1 year
Detailed visual grounding is crucial but to see that, we need to raise the bar for benchmarking. We created V*Bench, a challenging vision-centric benchmark. Our model with visual search outperforms GPT4V by a big margin, despite using a much worse LM. Better Vision Matters! (5/n)
Tweet media one
Tweet media two
3
13
87
@sainingxie
Saining Xie
7 months
Oh this is a fun hybrid architecture: using DiT for TTS, but at the input using ConvNeXt blocks for additional refinement.
@reach_vb
Vaibhav (VB) Srivastav
7 months
Overall architecture:
Tweet media one
1
12
87
@sainingxie
Saining Xie
1 month
Recently open-sourced projects from @TongPetersb, @DavidJFan, and the team at Meta FAIR. MetaMorph (training code and model weights): Web-SSL (model weights for Web-DINO and Web-MAE). FAIR's still leading the way in open research.
@TongPetersb
Peter Tong
1 month
We are open-sourcing all the models in Web-SSL, from ViT-L to ViT-7B! .It was super fun to train and play with these massive ViTs. Models: Github: Huge credit to @DavidJFan for putting these models together!.
1
13
87
@sainingxie
Saining Xie
8 months
Excited to present in the upcoming tutorials/workshops/posters and reconnect with old friends at #ECCV2024 Milano!.Sunday AM (29th): Tutorial on Large Multimodal Foundation Models (.Sunday PM (29th): 2nd OmniLabel Workshop on Enabling Complex Perception.
0
8
82
@sainingxie
Saining Xie
2 years
Just landed in #Paris🇫🇷 and excited to attend #ICCV2023 in person. Tomorrow (Oct 2nd) afternoon, join us at the #QVCV workshop (1:30pm - 5:30pm @ S03). Computer vision research is evolving - hear from thought leaders about what the future may hold. (
Tweet media one
1
8
79
@sainingxie
Saining Xie
6 months
Arrived in Vancouver for #NeurIPS2024 now! Don’t miss @_ellisbrown and @TongPetersb’s talk about Cambrian-1—I’ll be there and at the poster too. Excited to connect with you all!.
@_ellisbrown
Ellis Brown
6 months
Heading to #NeurIPS2024 to present Cambrian-1 w/ @TongPetersb! Catch our oral presentation Friday @ 10am (Oral 5C) and our poster afterwards until 2pm (#3700 in East Hall A-C) 🪼🎉
Tweet media one
0
4
79
@sainingxie
Saining Xie
7 months
As @ylecun often points out, relying solely on the "rendering" loss isn't enough. If your focus is just on reconstructing nice-looking pixels, there's no way to filter out irrelevant details from the input — which is key to learning robust representations. Looks like even if your.
1
4
73
@sainingxie
Saining Xie
1 year
In our new work: "Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs", we try to understand the roots of these errors. How do we do this? The key here is to explore the gap between the visual embedding space of CLIP and vision-only self-supervised learning.
Tweet media one
2
13
72
@sainingxie
Saining Xie
6 months
Jiatao is such an AI polymath, with amazing knowledge and experience across so many areas🤯! Every time I chat research with him, I come away learning so much. You should definitely apply to his lab—it’d be an incredible experience!.
@thoma_gu
Jiatao Gu
6 months
Life update: Excited to share that I will be joining @CIS_Penn @PennEngineers as an Assistant Professor in Fall 2025!🤯. I’m also seeking multiple PhD students passionate about Generative Intelligence and leveraging it to empower AI agents to interact with the Physical World🌟
Tweet media one
1
0
68
@sainingxie
Saining Xie
3 years
Looking forward to attending in person tomorrow!.
@gberta227
Gedas Bertasius
3 years
A reminder that our Transformers for Vision workshop in #CVPR22 is happening this Sunday, June 19th at 7:50am CST in Great Hall D (and also on Zoom). We have an amazing speaker lineup, great panelists, and excellent paper sessions. Looking forward to seeing everybody!
Tweet media one
2
6
64
@sainingxie
Saining Xie
1 year
Excited to see more open multimodal LLMs!.Also quite impressive vision-centric performance on e.g. MMVP ( with Gemma-2B.
Tweet media one
@giffmana
Lucas Beyer (bl16)
1 year
We release PaliGemma. I'll keep it short, still on vacation:. - sota open base VLM designed to transfer quickly, easily, and strongly to a wide range of tasks.- Also does detection and segmentation.- We provide lots of examples.- Meaty tech report later!.
Tweet media one
2
7
64
@sainingxie
Saining Xie
1 month
not vibe/value alignment but “reality alignment”: we could let the model imagine in the multi-universe, then align it to earth in post-training. @giffmana you might find this interesting too then:
@giffmana
Lucas Beyer (bl16)
1 month
This paper is interestingly thought- provoking for me. There is a chance, that it's easier to "align t2i model with real physics" in post-training. And let it learn to generate whatever (physically implausible) combinations in pretrain. As opposed to trying hard to come up with.
2
12
64
@sainingxie
Saining Xie
7 months
Some key observations:.1⃣ As many have noticed recently, diffusion transformers can produce reasonable representations, and better generative models lead to stronger representations. 2⃣ However, these are still much weaker than sota visual representations learned through SSL
Tweet media one
2
2
61
@sainingxie
Saining Xie
11 months
(🤷Now a bit of rant) The real issue is that working on visual representation learning is quite challenging right now. While CLIP-based models, which are strongly supervised by language, have proven to be effective, they come with their own set of problems, such as attribute
Tweet media one
2
0
62
@sainingxie
Saining Xie
11 months
Finally, this is a completely open project where we have released the training code, model weights, all benchmarks, and detailed information such as system prompts and evaluation pipelines. These aspects are sometimes overlooked in research papers, so we made sure to provide them
Tweet media one
1
1
56
@sainingxie
Saining Xie
7 months
Welcome Gautam! So excited to have you join us🥳👏!!. Thinking about a PhD in AI? @NYU_Courant is a great place, and so is NYC. Apply before the deadline!.
@thegautamkamath
Gautam Kamath
7 months
Excited to announce that I'm joining NYU Courant (@NYU_Courant) CS as an Assistant Professor in Fall 2025 (~1 year from now). If you wish to work with me as a PhD student in theory/empirics of robustness/privacy in stats/ML (or related topics), apply to Courant CS by Dec 12! 1/n
Tweet media one
0
1
54
@sainingxie
Saining Xie
1 year
We integrate the VQA LLM with a visual search model. With LLM's world knowledge, V* performs multi-round, guided search for visual targets. It extracts local features, and adds them to a working memory. The searched data are then used by VQA LLM to generate final responses. (4/n)
Tweet media one
1
5
56
@sainingxie
Saining Xie
1 year
We have the model & a local gradio demo that you can download & play with:
@abacaj
anton
1 year
This new V* (guided visual search) model & paper is actually a big deal in my opinion. GPT-4V could *not* solve this google recaptcha I had been testing. But now. with the help of the guided V* model it could find the final traffic light
Tweet media one
Tweet media two
0
5
55
@sainingxie
Saining Xie
1 year
Great to see this got a spotlight at ICLR & the community has welcomed a "pure data" paper with open arms. Congrats @Hu_Hsu!.
@sainingxie
Saining Xie
2 years
This difference in data sources and filters is highlighted in our "Demystifying CLIP Data" ( paper. Instead of viewing it as a fresh "MetaCLIP" model family, think of it as a "manual for building CLIP from the ground up".
0
2
54
@sainingxie
Saining Xie
11 months
This is why we're developing Project Cambrian to pave the way for more vision-centric explorations. Our key positioning is that multimodal LLM frameworks, like LLaVA, provide excellent evaluation protocols for visual representation learning:. 1) They can seamlessly integrate
Tweet media one
2
3
52
@sainingxie
Saining Xie
8 months
Congrats @drfeifei! Excited to see so many brilliant vision researchers (including my ex-FAIR colleagues @jcjohnss @chaoyuaw) taking on what I believe is *the most exciting* challenge in AI. Spatial intelligence FTW.
@drfeifei
Fei-Fei Li
8 months
What is a really really hard problem to work on in #AI? My own answer is Spatial Intelligence - a technology that could empower and enable countless possible use cases in creation, design, learning, AR/VR, robotics, and beyond. It’s a real honor that my cofounders @jcjohnss.
1
5
52
@sainingxie
Saining Xie
4 months
The true golden age is one of exploration, not exploitation.
@NandoDF
Nando de Freitas
4 months
This thread by @scott_e_reed, one of the best deep learning researchers in the world, summarises well what many experienced working for industrial AI .labs over the last two years: . 1. Winner take all politics.2. An erosion of our ability to innovate .3. An erosion of our belief.
0
3
50
@sainingxie
Saining Xie
1 year
The project was led by our amazing PhD student @TongPetersb at @NYU_Courant, and it was a great collaboration with folks at NYU and Berkeley!. ArXiv: Blogpost: Code:
3
7
50
@sainingxie
Saining Xie
6 months
imagination is generative; control is 3D.
@jcjohnss
Justin Johnson
6 months
Today we're sharing our first research update @theworldlabs -- a generative model of 3D worlds!. I'm super proud of what the team has achieved so far, and can't wait to see what comes next. Lifting GenAI to 3D will change the way we make media, from movies to games and more!.
1
4
49
@sainingxie
Saining Xie
11 months
The connector that integrates vision and language is crucial in the Cambrian framework and is regarded as a part of the visual representation. Relying on a simple MLP projector may not fully harness the potential of a good visual representation. In Cambrian-1, we propose a
Tweet media one
2
2
44
@sainingxie
Saining Xie
2 months
now I go to two NYC film festivals every year. highly recommend.
@c_valenzuelab
Cristóbal Valenzuela
2 months
Tickets for the 3rd Annual AI Film Festival in NYC and LA are now available!.
0
1
44
@sainingxie
Saining Xie
16 days
Congrats @rob_fergus ! Big win for FAIR.
@rob_fergus
Rob Fergus
16 days
1/ Excited to share that I’m taking on the role of leading Fundamental AI Research (FAIR) at Meta. Huge thanks to Joelle for everything. Look forward to working closely again with Yann & team.
1
2
44
@sainingxie
Saining Xie
11 months
a fun collaboration with the system group at nyu. through sparse all2all comm; dynamic load balancing and large batch hyperparameter scaling rule, now you can finally train your large 3dgs on many gpus🔥without any loss in quality. led by @zhaohexu2001 haoyang & @fred_lu_443.
Tweet media one
@janusch_patas
MrNeRF
11 months
On Scaling Up 3D Gaussian Splatting Training.Project: Code (Apache 2.0): => Parallelize training over multiple GPUs. Make sure to checkout the project page, it is awesome! .Method ⬇️ 1 | 2
0
6
43
@sainingxie
Saining Xie
11 months
Let’s start with benchmarking. Currently, the field is quite chaotic. While having more benchmarks is beneficial as it captures diverse behaviors, consolidating and interpreting results from various tasks becomes almost impossible, and different studies select different sets of
Tweet media one
Tweet media two
5
1
42
@sainingxie
Saining Xie
1 year
With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes areas where state-of-the-art systems, including GPT-4V, struggle with straightforward questions across 9 basic visual patterns, often providing incorrect answers and hallucinated
Tweet media one
2
4
41
@sainingxie
Saining Xie
4 years
How many labels do we need to train an instance segmentation model for 3D scenes? It turns out not too many! With the help of our new pre-training method Contrastive Scene Contexts, only 20 annotated points per scene are good enough to produce high quality results on ScanNet!.
@j1h0u
Ji Hou
4 years
Sharing our new work Contrastive Scene Contexts, a new pre-training method for data-efficient learning in 3D. New ScanNet benchmark coming up soon!. Project: Paper: (w/ BenGraham @MattNiessner @sainingxie).
0
4
40
@sainingxie
Saining Xie
2 years
The awesome @billpeeb is presenting scaling diffusion transformers now (until 4:30pm) at #ICCV2023 in Room "Foyer Sud" - 188. He will be giving an oral talk right after the poster session. I will also be there! Come say hi if you are around!
Tweet media one
2
3
39
@sainingxie
Saining Xie
1 year
This is mind blowing 🤯 congrats @billpeeb and the team.
@model_mechanic
Aditya Ramesh
1 year
Excited to share what @billpeeb @_tim_brooks and my team has been working on for the past year! Our text-to-video model Sora can generate videos of complex scenes up to a minute long. We're excited about making this step toward AI that can reason about the world like we do.
0
0
39
@sainingxie
Saining Xie
5 months
I'm delighted to see how the fun brainstorming with @drfeifei on spatial intelligence this year have evolved into an amazing collaboration between NYU, Yale, and Stanford. A huge shoutout to @jihanyang13, @shushengyang, @AnjaliWGupta, and @rilyn_han for leading this effort! If.
2
1
39
@sainingxie
Saining Xie
1 year
since many of you are curious -- the video was done by just me and students using tools like @dalle_openai @pika_labs @capcutapp @googleearth, music @blurofficial parklife (1994) -- we are ai researchers, and we are also ai users.
2
0
38
@sainingxie
Saining Xie
1 year
If you're interested in an internship at Google Research focusing on video generation🎞️, Xuan is definitely the person to talk to!.
@XuanYang1
Xuan Yang
1 year
We are hiring PhD interns to work on video generation research at Google Research US. Please reach out to xuanyang@google.com if you are interested.
0
3
35
@sainingxie
Saining Xie
5 months
In vision, we handle space but rarely reason; multimodal LLMs think but often ignore spatial logic. Yet as humans—from taking a mental rotation test or picking out furniture for a new home—we rely on spatial and visual thinking that doesn’t always translate well into words. [2/n]
Tweet media one
Tweet media two
2
1
37
@sainingxie
Saining Xie
2 years
NYU Courant is looking for motivated First Years and Sophomores of all backgrounds to join a 6 week summer program in AI research! Expenses paid plus stipend.
@NYU_Courant
NYU Courant
2 years
Applications are open for the Summer 2023 Pathways to AI Program!. We are seeking first and second-year undergraduates pursuing careers in AI research. Students from underrepresented groups are especially encouraged to learn more and apply by Feb. 1st:
Tweet media one
1
9
36
@sainingxie
Saining Xie
1 year
@wightmanr I think it’s most likely a 3D convnet similar to Magvit.
2
0
34
@sainingxie
Saining Xie
11 months
Let's talk about our visual representations. It's no surprise that CLIP models come out on top, but here are some interesting takeaways (about SSL):.1) Unfreezing the vision encoder is generally very beneficial but provides more significant improvements for SSL models on
Tweet media one
Tweet media two
1
0
34
@sainingxie
Saining Xie
1 year
This goes beyond just theoretical concerns; the missing mechanism causes failures in multimodal LLMs. In the following VQA examples, even GPT4V struggles and hallucinates answers. But there's a solution: our model (SEAL) can accurately answer them, thanks to the V* search. (3/n)
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
3
35
@sainingxie
Saining Xie
1 year
Finally, we tried to recreate @ylecun's famous LeCake® -- we can take a picture of a cake, slice it, put one (or several) cherries on top, and even twist the stem to our liking -- I personally think it looks yummier than the original one :) .As far as I'm aware, this is a
Tweet media one
Tweet media two
2
5
31
@sainingxie
Saining Xie
4 months
Btw V-IRL ( VLN is such a natural setting to eval OOD behavior for multimodal agent: you can train the agent in one city, and then test it in a completely different environment. Same action space but completely different visual context to adapt to.
@TianzheC
Tianzhe Chu
4 months
[3.2] 🗺️V-IRL consists two rules based on shifts of action orientation: {east, west, . } and {left, right, . }. Visual variants include training are characterized by different locations (e.g., train the agent to navigate in one place, and ask it to navigate in another).
0
9
32
@sainingxie
Saining Xie
1 year
Why does this matter? Consider everyday situations like locating keys on a cluttered table or spotting a friend in a crowd: we engage our system II and actively *search* for the necessary visual info -- we do not have an 'internal CLIP' that shows us everything all at once. (2/n).
2
0
33
@sainingxie
Saining Xie
2 years
Congrats! It finally feels like the AI version of a CERN paper now :) We should have more global/open initiatives of this kind. Next up, 5000 co-authors 😀?.
@LerrelPinto
Lerrel Pinto
2 years
It was super fun working with @QuanVng and the Google team on building a robot model that works across robots. While the results are wonderful, I'm most excited about the data from several amazing labs being accessible on a common platform.
2
1
31
@sainingxie
Saining Xie
11 months
As a by-product of our explorations, our project has developed a highly capable MLLM model that significantly outperforms other methods like miniGemini and LLaVA-Next, using the same underlying LLMs. Remarkably, we achieve this with only a quarter of the visual tokens.
Tweet media one
1
3
32
@sainingxie
Saining Xie
1 year
hybrid model (Conv + DiT) ftw?.
@RiversHaveWings
Rivers Have Wings
1 year
Hourglass + Diffusion = ❤️. We introduce a new transformer backbone for diffusion models that can directly generate megapixel images without the need for multiple stages like latent diffusion. Read here! → Project page →
Tweet media one
2
0
33
@sainingxie
Saining Xie
2 years
This difference in data sources and filters is highlighted in our "Demystifying CLIP Data" ( paper. Instead of viewing it as a fresh "MetaCLIP" model family, think of it as a "manual for building CLIP from the ground up".
@NielsRogge
Niels Rogge
2 years
CLIP by @OpenAI was revolutionary, but its data curation pipeline was never detailed nor open-sourced. @Meta has now released MetaCLIP, a fully open-source replication. Models are on the hub:
1
2
32
@sainingxie
Saining Xie
11 months
@giffmana Haha, I think this looks much better than my rushed deadline attempt. (If you're not sure what Lucas is referring to, this was the original one.)
Tweet media one
3
1
31
@sainingxie
Saining Xie
1 year
Finally, visual search has been a core problem in cogsci / vision science for decades (e.g. pioneering work by @jeremymwolfe). Interestingly, when compared with human eye fixations, the LLM guided V* can attain an efficiency comparable to that of human visual search! (6/n)
Tweet media one
Tweet media two
2
1
29
@sainingxie
Saining Xie
1 year
@GoogleDeepMind @LFC So glad to see this as a kop, can it pull off a corner taken quickly though? ;).
1
0
24
@sainingxie
Saining Xie
8 months
A new efficient video captioner and benchmark. Now available open source too. Led by the amazing @re5e1f !.
@wenhaocha1
Wenhao Chai
8 months
(1/n) 📸Want to train a better video generation model? You need a better video captioner first! Check out AuroraCap! Trained on 20M+ high-quality data, it’s more efficient at inference. #lmms #videogen.🔗Homepage:
Tweet media one
1
0
27
@sainingxie
Saining Xie
5 months
One of my favorite parts of our study is the analysis showing how different these tasks are from the language-centric intelligence. When asked to explain themselves, LLMs revealed that spatial reasoning—not object recognition or language ability—is the main bottleneck. They often
Tweet media one
Tweet media two
2
2
25
@sainingxie
Saining Xie
5 months
Finally, we also probe the model by prompting it to "visualize" its memory on a Cartesian grid, where each occupied cell represents an object center. Our findings reveal that when processing spatial information, a MLLM constructs a series of localized world models from the given
Tweet media one
1
1
24
@sainingxie
Saining Xie
11 months
exactly. that’s the power of open science and knowledge sharing: we learn and gain confidence in what we find!.
@giffmana
Lucas Beyer (bl16)
11 months
PS: A couple other very nice papers came out during the making of this (MM1, Cambrian-1, Idefics2, . what an active field!), which reflect some of our findings - great, we can be more confident about those!.
1
2
25