Tyler Zhu
@tyleryzhu
Followers
2K
Following
64K
Media
71
Statuses
1K
PhD student @VisualAILab | SR @GoogleDeepMind | prev @berkeley_ai | @SFGiants @warriors guy
Berkeley, CA
Joined March 2020
Today seems to be a fitting day for @GoogleDeepMind news, so I'm excited to announce our new preprint! Prior work suggests that text & img repr's are converging, albeit weakly. We found these same models actually have strong alignment; the inputs were too impoverished to see it!
11
25
132
Thinking (test-time compute) in pixel space... 🍌 Pro tip: always peek at the thoughts if you use AI Studio. Watching the model think in pictures is really fun!
21
81
697
Today, we present a step-change in robotic AI @sundayrobotics. Introducing ACT-1: A frontier robot foundation model trained on zero robot data. - Ultra long-horizon tasks - Zero-shot generalization - Advanced dexterity đź§µ->
425
655
5K
In moving from static images and text to dynamic videos and text descriptions, we better reflect Plato's vision of perception which is grounded in reality, not merely shadows on the cave. This is a step towards that, but there are still many unanswered Qs (eg, generative models?)
0
0
5
Finally, we show that this alignment, despite being a "semantic" metric, is promising as a zero-shot video probe of downstream tasks. There is a strong pos. correlation w/ semantic tasks like action class., but also geometric ones like obj tracking and depth estimation!
1
0
3
We create Chinchilla-style scaling laws to quantify this scaling behavior. The coefficients relay the same information as above, i.e. the coefficients indicate maximum penalty for a poor approximation, and the exponents are how fast you're able to incorporate new data.
1
0
3
This only gets better as we scale along both axes of frames and captions. While both VideoMAEv2 and DINOv2 get much better with more captions, VideoMAEv2 uses video info better. In total, we nearly double alignment from what a single image & caption can offer by matching reality
1
0
3
We benchmark 121 models+variants in total. On the same setting as the original PRH, we reproduce that image models are at best only ~20% aligned w/ SoTA LLMs (Gemma) using a single caption. Native video models instead have both the best repr's (retrieval) and alignment (25%!)
1
0
3
The key to this is using videos and multiple captions, which more accurately reflect the true underlying scenes. We use both the VaTeX dataset as well as PVD w/ synthesized captions, and we sample varying amounts of visual/text info to understand their relationship better.
1
0
3
However, their original study found that alignment b/w image models and LLMs capped at 0.16. Does this mean 0.16 is strong alignment, or that there is still a strong gap b/w models? We found that by moving to dynamic inputs (and video models), we could achieve scores of 0.40!
1
0
6
The "Platonic Representation Hypothesis," by @phillip_isola and co., posited that different NNs trained on different data and modalities, i.e. large ViTs and LLMs, are converging to a shared model of reality. This makes sense, as all data is a projection of a shared reality!
1
0
8
arXiv page: https://t.co/NsGdjZW4qb Project overview page: https://t.co/Kk24V0Q3AZ hf page: https://t.co/akV6QkH7kY Code coming soon! Work done at Google DeepMind, and in collaboration with the fantastic team of @TengdaHan, Leo Guibas, Viorica Patraucean, and Maks Ovsjanikov.
huggingface.co
1
0
9
Ritwik is a great mentor and figures to be an even better advisor. You’re missing out if you have shared interests and don’t apply!
I am recruiting Ph.D. students at @umdcs starting Fall 2026! I am looking for students in three broad areas: (1) Physics-integrated computer vision (2) VLMs with constraints (2) Dual-use AI policy We're ranked #3 in AI on @CSrankings! Specific details in đź§µ
0
0
4
Text-to-image (T2I) models can generate rich supervision for visual learning but generating subtle distinctions still remains challenging. Fine-tuning helps, but too much tuning → overfitting and loss of diversity. How do we preserve fidelity without sacrificing diversity (1/8)
2
13
39
Last Friday, I wrapped up my 24-week Student Researcher role at @GoogleDeepMind in London. I’m deeply thankful to my hosts @PetarV_93 and @re_rayne for their guidance, and to all the brilliant minds at @GoogleDeepMind for their inspiration and collaboration. I’ve also had a lot
5
7
193
🚨🌶️ Did you realise you can get alignment `training’ data out of open weights models? Oops We show that models will regurgitate alignment data that is (semantically) memorised. This data can come from SFT and RL... and can be used to train your own models! 🧵
10
42
241
@farhadi @AlisonGopnik @ICCVConference @eunice_yiu_ @phillip_isola Last but not least, @sammtmd is closing off by telling us about the very exciting and timely Physics-IQ benchmark track and its contestants
0
0
4
@farhadi @AlisonGopnik @ICCVConference @eunice_yiu_ @phillip_isola is now giving a talk on symbol grounding in multimodal AI! Floor 4 Ballroom B
1
0
2
If you are at ICCV, I'm giving a talk here at 3:30pm in Ballroom B on "Revisiting the symbol grounding problem in the age of multi-modal AI" Will cover recent work on multimodal rep alignment in unpaired and unimodal models.
Join us TODAY for the 3rd Perception Test Challenge https://t.co/DVHQFjkyuA
@ICCV2025! Ballroom B, Full day Amazing lineup of speakers: @farhadi, @AlisonGopnik, Phlipp Krahenbul, @phillip_isola
4
13
82
@farhadi @AlisonGopnik @ICCVConference @eunice_yiu_ is presenting the Kiva track now, along with the winning team presentations!
1
0
1