Tyler Zhu @tyleryzhu X Profile

Tyler Zhu

@tyleryzhu

Followers

2K

Following

64K

Media

71

Statuses

1K

PhD student @VisualAILab | SR @GoogleDeepMind | prev @berkeley_ai | @SFGiants @warriors guy

https://t.co/WMl0pLlf1u

Berkeley, CA

Joined March 2020

Don't wanna be here? Send us removal request.

Tyler Zhu

@tyleryzhu

11 days

Today seems to be a fitting day for @GoogleDeepMind news, so I'm excited to announce our new preprint! Prior work suggests that text & img repr's are converging, albeit weakly. We found these same models actually have strong alignment; the inputs were too impoverished to see it!

11

25

132

Mostafa Dehghani

@m__dehghani

10 days

Thinking (test-time compute) in pixel space... 🍌 Pro tip: always peek at the thoughts if you use AI Studio. Watching the model think in pictures is really fun!

21

81

697

Tony Zhao

@tonyzzhao

10 days

Today, we present a step-change in robotic AI @sundayrobotics. Introducing ACT-1: A frontier robot foundation model trained on zero robot data. - Ultra long-horizon tasks - Zero-shot generalization - Advanced dexterity 🧵->

425

655

5K

Tyler Zhu

@tyleryzhu

11 days

In moving from static images and text to dynamic videos and text descriptions, we better reflect Plato's vision of perception which is grounded in reality, not merely shadows on the cave. This is a step towards that, but there are still many unanswered Qs (eg, generative models?)

0

5

Tyler Zhu

@tyleryzhu

11 days

Finally, we show that this alignment, despite being a "semantic" metric, is promising as a zero-shot video probe of downstream tasks. There is a strong pos. correlation w/ semantic tasks like action class., but also geometric ones like obj tracking and depth estimation!

1

0

3

Tyler Zhu

@tyleryzhu

11 days

We create Chinchilla-style scaling laws to quantify this scaling behavior. The coefficients relay the same information as above, i.e. the coefficients indicate maximum penalty for a poor approximation, and the exponents are how fast you're able to incorporate new data.

1

0

3

Tyler Zhu

@tyleryzhu

11 days

This only gets better as we scale along both axes of frames and captions. While both VideoMAEv2 and DINOv2 get much better with more captions, VideoMAEv2 uses video info better. In total, we nearly double alignment from what a single image & caption can offer by matching reality

1

0

3

Tyler Zhu

@tyleryzhu

11 days

We benchmark 121 models+variants in total. On the same setting as the original PRH, we reproduce that image models are at best only ~20% aligned w/ SoTA LLMs (Gemma) using a single caption. Native video models instead have both the best repr's (retrieval) and alignment (25%!)

1

0

3

Tyler Zhu

@tyleryzhu

11 days

The key to this is using videos and multiple captions, which more accurately reflect the true underlying scenes. We use both the VaTeX dataset as well as PVD w/ synthesized captions, and we sample varying amounts of visual/text info to understand their relationship better.

1

0

3

Tyler Zhu

@tyleryzhu

11 days

However, their original study found that alignment b/w image models and LLMs capped at 0.16. Does this mean 0.16 is strong alignment, or that there is still a strong gap b/w models? We found that by moving to dynamic inputs (and video models), we could achieve scores of 0.40!

1

0

6

Tyler Zhu

@tyleryzhu

11 days

The "Platonic Representation Hypothesis," by @phillip_isola and co., posited that different NNs trained on different data and modalities, i.e. large ViTs and LLMs, are converging to a shared model of reality. This makes sense, as all data is a projection of a shared reality!

1

0

8

Tyler Zhu

@tyleryzhu

11 days

arXiv page: https://t.co/NsGdjZW4qb Project overview page: https://t.co/Kk24V0Q3AZ hf page: https://t.co/akV6QkH7kY Code coming soon! Work done at Google DeepMind, and in collaboration with the fantastic team of @TengdaHan, Leo Guibas, Viorica Patraucean, and Maks Ovsjanikov.

huggingface.co

1

0

9

Tyler Zhu

@tyleryzhu

18 days

Ritwik is a great mentor and figures to be an even better advisor. You’re missing out if you have shared interests and don’t apply!

Ritwik Gupta 🇺🇦

@Ritwik_G

18 days

I am recruiting Ph.D. students at @umdcs starting Fall 2026! I am looking for students in three broad areas: (1) Physics-integrated computer vision (2) VLMs with constraints (2) Dual-use AI policy We're ranked #3 in AI on @CSrankings! Specific details in 🧵

0

4

William Yang

@YangWilliam_

30 days

Text-to-image (T2I) models can generate rich supervision for visual learning but generating subtle distinctions still remains challenging. Fine-tuning helps, but too much tuning → overfitting and loss of diversity. How do we preserve fidelity without sacrificing diversity (1/8)

2

13

39

Xiangming Gu

@gu_xiangming

1 month

Last Friday, I wrapped up my 24-week Student Researcher role at @GoogleDeepMind in London. I’m deeply thankful to my hosts @PetarV_93 and @re_rayne for their guidance, and to all the brilliant minds at @GoogleDeepMind for their inspiration and collaboration. I’ve also had a lot

5

7

193

Federico Barbero

@fedzbar

1 month

🚨🌶️ Did you realise you can get alignment `training’ data out of open weights models? Oops We show that models will regurgitate alignment data that is (semantically) memorised. This data can come from SFT and RL... and can be used to train your own models! 🧵

10

42

241

Tyler Zhu

@tyleryzhu

1 month

@farhadi @AlisonGopnik @ICCVConference @eunice_yiu_ @phillip_isola Last but not least, @sammtmd is closing off by telling us about the very exciting and timely Physics-IQ benchmark track and its contestants

0

4

Tyler Zhu

@tyleryzhu

1 month

@farhadi @AlisonGopnik @ICCVConference @eunice_yiu_ @phillip_isola is now giving a talk on symbol grounding in multimodal AI! Floor 4 Ballroom B

1

0

2

Phillip Isola

@phillip_isola

1 month

If you are at ICCV, I'm giving a talk here at 3:30pm in Ballroom B on "Revisiting the symbol grounding problem in the age of multi-modal AI" Will cover recent work on multimodal rep alignment in unpaired and unimodal models.

Shiry Ginosar

@shiryginosar

1 month

Join us TODAY for the 3rd Perception Test Challenge https://t.co/DVHQFjkyuA @ICCV2025! Ballroom B, Full day Amazing lineup of speakers: @farhadi, @AlisonGopnik, Phlipp Krahenbul, @phillip_isola

4

13

82

Tyler Zhu

@tyleryzhu

1 month

@farhadi @AlisonGopnik @ICCVConference @eunice_yiu_ is presenting the Kiva track now, along with the winning team presentations!

1

0

1