Kevin Chih-Yao Ma
@chihyaoma
Followers
623
Following
465
Media
30
Statuses
103
Building multimodal foundation models @MicrosoftAI | Past: a lead IC & babysitter of Meta's MovieGen, Emu, Imagine, ...
Seattle, WA
Joined August 2014
💼 [Career Update] I feel fortunate to have spent the past 4 years at Meta, working with some of the brightest minds to build Meta’s multimodal foundation models — MovieGen, Emu, and more. It shaped me not only as a researcher but as a person. Now, it’s time for a new chapter. I
8
4
146
Meet our third @MicrosoftAI model: MAI-Image-1 #9 on LMArena, striking an impressive balance of generation speed and quality Excited to keep refining + climbing the leaderboard from here! We're just getting started. https://t.co/33BiNfIjPg
37
82
518
Vision tokenizers are stuck in 2020🤔while language models revolutionized AI🚀 Language: One tokenizer for everything Vision: Fragmented across modalities & tasks Introducing AToken: The first unified visual tokenizer for images, videos & 3D that does BOTH reconstruction AND
Apple presents AToken: A unified visual tokenizer • First tokenizer unifying images, videos & 3D • Shared 4D latent space (preserves both reconstruction & semantics) • Strong across gen & understanding tasks (ImageNet 82.2%, MSRVTT 32.6%, 3D acc 90.9%)
6
73
374
🐺 Introducing the Werewolf Benchmark, an AI test for social reasoning under pressure. Can models lead, bluff, and resist manipulation in live, adversarial play? 👉 We made 7 of the strongest LLMs, both open-source and closed-source, play 210 full games of Werewolf. Below is
80
142
774
A picture now is worth more than a thousand words in genAI; it can be turned into a full 3D world! And you can stroll in this garden endlessly long, it will still be there.
149
345
3K
What if you could not only watch a generated video, but explore it too? 🌐 Genie 3 is our groundbreaking world model that creates interactive, playable environments from a single text prompt. From photorealistic landscapes to fantasy realms, the possibilities are endless. 🧵
828
3K
14K
🚀Introducing Hierarchical Reasoning Model🧠🤖 Inspired by brain's hierarchical processing, HRM delivers unprecedented reasoning power on complex tasks like ARC-AGI and expert-level Sudoku using just 1k examples, no pretraining or CoT! Unlock next AI breakthrough with
227
653
4K
Introducing AlphaEarth Foundations an AI model that integrates petabytes of satellite data into a single digital representation of Earth. It'll give scientists a nearly real-time view of the planet to incredible spatial precision, and help with critical issues like food security,
deepmind.google
New AI model integrates petabytes of Earth observation data to generate a unified data representation that revolutionizes global mapping and monitoring
Our new AI model AlphaEarth Foundations is mapping the planet in astonishing detail. 🌏🔍 Scientists will now be able to track the impact of deforestation, monitoring crop health, and more – significantly faster, thanks to our new datasets. 🧵
122
450
3K
Scaling CLIP on English-only data is outdated now… 🌍We built CLIP data curation pipeline for 300+ languages 🇬🇧We train MetaCLIP 2 without compromising English-task performance (it actually improves! 🥳It’s time to drop the language filter! 📝 https://t.co/pQuwzH053M [1/5] 🧵
3
92
312
Excited to share what I worked on during my time at Meta. - We introduce a Triton-accelerated Transformer with *2-simplicial attention*—a tri-linear generalization of dot-product attention - We show how to adapt RoPE to tri-linear forms - We show 2-simplicial attention scales
26
98
816
Self-supervised Learning (SSL) vs Contrastive Language-Image (CLIP) models is a never-ending battle What about using both? This Google paper does exactly that, and results are really good on many different tasks TIPS is a model trained with a CLIP loss and 2 SSL losses [1/9]
5
75
602
Transition Matching (TM) -- a discrete-time, continuous-state generative paradigm that unifies and advances both diffusion/flow models and continuous AR generation. Claimed to be the first fully causal model to match or surpass the performance of flow-based methods on
[1/n] New paper alert! 🚀 Excited to introduce 𝐓𝐫𝐚𝐧𝐬𝐢𝐭𝐢𝐨𝐧 𝐌𝐚𝐭𝐜𝐡𝐢𝐧𝐠 (𝐓𝐌)! We're replacing short-timestep kernels from Flow Matching/Diffusion with... a generative model🤯, achieving SOTA text-2-image generation! @urielsinger @itai_gat @lipmanya
0
0
5
We're taking a big step towards medical superintelligence. AI models have aced multiple choice medical exams – but real patients don’t come with ABC answer options. Now MAI-DxO can solve some of the world’s toughest open-ended cases with higher accuracy and lower costs.
139
506
3K
Q-learning is not yet scalable https://t.co/hoYUdAAeGZ I wrote a blog post about my thoughts on scalable RL algorithms. To be clear, I'm still highly optimistic about off-policy RL and Q-learning! I just think we haven't found the right solution yet (the post discusses why).
37
194
1K
We're excited to launch Scouts — always-on AI agents that monitor the web for anything you care about.
139
174
3K
Seedance 1.0 has just claimed the #1 spot on the @ArtificialAnlys Video Arena Leaderboard! 🚀 Proudly topping the charts in BOTH “text-to-video” and “image-to-video”categories, Seedance 1.0 sets a new standard for AI video creation. Stay tuned as Seedance 1.0 will soon be
23
21
185
Is the new king of video generation emerging? ByteDance’s Seed team is showing serious potential with their latest model, possibly outshining DeepMind’s Veo3, which was just announced weeks ago. I have been impressed by the consistent stream of inspiring work from Seed.
0
2
5
new paper from our work at Meta! **GPT-style language models memorize 3.6 bits per param** we compute capacity by measuring total bits memorized, using some theory from Shannon (1953) shockingly, the memorization-datasize curves look like this: ___________ / / (🧵)
83
377
3K
A recent clarity that I gained is viewing AI research as a “max-performance domain”, which means that you can be world-class by being very good at only one part of your job. As long as you can create seminal impact (e.g., train the best model, start a new paradigm, or create
27
55
668