Yushi Hu Profile
Yushi Hu

@huyushi98

Followers
3K
Following
2K
Media
28
Statuses
243

Research Scientist @Meta FAIR|PhD @uwnlp | Prev. @allen_ai @GoogleAI @UChicago | We still have a lot to do on multimodal intelligence

Seattle, WA
Joined November 2020
Don't wanna be here? Send us removal request.
@huyushi98
Yushi Hu
7 months
Excited to see the image reasoning in o3 and o4-mini!!🤩 We introduced this idea a year ago in visual Sketchpad ( https://t.co/v5TYvzyjGM). Excited to see @OpenAI baking this into their model through agentic RL. Great work! And yes, reasoning should be multimodal! Huge shoutout
@OpenAI
OpenAI
7 months
Introducing OpenAI o3 and o4-mini—our smartest and most capable models to date. For the first time, our reasoning models can agentically use and combine every tool within ChatGPT, including web search, Python, image analysis, file interpretation, and image generation.
2
8
70
@AIatMeta
AI at Meta
3 days
Today we’re excited to unveil a new generation of Segment Anything Models: 1️⃣ SAM 3 enables detecting, segmenting and tracking of objects across images and videos, now with short text phrases and exemplar prompts. 🔗 Learn more about SAM 3: https://t.co/tIwymSSD89 2️⃣ SAM 3D
107
588
4K
@AIatMeta
AI at Meta
3 days
Introducing SAM 3D, the newest addition to the SAM collection, bringing common sense 3D understanding of everyday images. SAM 3D includes two models: 🛋️ SAM 3D Objects for object and scene reconstruction 🧑‍🤝‍🧑 SAM 3D Body for human pose and shape estimation Both models achieve
128
1K
6K
@COLM_conf
Conference on Language Modeling
18 days
COLM Keynotes: Luke Zettlemoyer Mixed-modal Language Modeling https://t.co/8FdhhrfOnG
0
20
149
@jiasenlu
Jiasen Lu
1 month
🚀Our Code and Weight is public available. We provide detailed examples, and try out at https://t.co/0jE0CrSTLY
@arankomatsuzaki
Aran Komatsuzaki
2 months
Apple presents AToken: A unified visual tokenizer • First tokenizer unifying images, videos & 3D • Shared 4D latent space (preserves both reconstruction & semantics) • Strong across gen & understanding tasks (ImageNet 82.2%, MSRVTT 32.6%, 3D acc 90.9%)
3
36
237
@huyushi98
Yushi Hu
1 month
We are hiring a PhD research intern (summer 2026) at Meta FAIR to work on frontier multimodal generation models! Apply here: https://t.co/M5htDrYUBR Feel free to DM me if you have any questions!
4
44
312
@gh_marjan
Marjan Ghazvininejad
1 month
We’re hiring a Research Intern (Summer 2026) at FAIR to advance the frontiers of Multimodal Generative AI. Learn more and apply here: https://t.co/aVRV04lp6V If you have questions, feel free to DM me or meet me at the RAM 2: Reasoning, Attention & Memory workshop at COLM.
3
41
349
@XiaochuangHan
Xiaochuang Han
1 month
Our team at Meta FAIR is hiring a PhD research intern for 2026. The topics broadly involve multimodal generative AI (e.g., video/image generation in addition to text), with flexible approaches across architecture/data/algorithms. Please apply via the link below, and feel free to
3
43
257
@__JohnNguyen__
John Nguyen
2 months
Transfusion combines autoregressive with diffusion to train a single transformer, but what if we combine Flow with Flow? 🤔 🌊OneFlow🌊 the first non-autoregressive model to generate text and images concurrently using a single transformer—unifying Edit Flow (text) with Flow
7
83
412
@taoyds
Tao Yu
3 months
As computer-use agents (CUAs) handle critical digital tasks, open research is key to study their capabilities, risks. 🚀After a year, we release OpenCUA: 1) largest CUA dataset/tool, 2) training recipe, 3) ~SOTA model on OSWorld. Released to drive transparent,safe CUA research!
@xywang626
Xinyuan Wang
3 months
We are super excited to release OpenCUA — the first from 0 to 1 computer-use agent foundation model framework and open-source SOTA model OpenCUA-32B, matching top proprietary models on OSWorld-Verified, with full infrastructure and data. 🔗 [Paper] https://t.co/naBIDnyvYY 📌
2
27
133
@AIatMeta
AI at Meta
3 months
Introducing DINOv3: a state-of-the-art computer vision model trained with self-supervised learning (SSL) that produces powerful, high-resolution image features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense
360
789
5K
@OfficialLoganK
Logan Kilpatrick
4 months
Introducing Genie 3, the most advanced world simulator ever created, enabled by numerous research breakthroughs. 🤯 Featuring high fidelity visuals, 20-24 fps, prompting on the go, world memory, and more.
563
1K
9K
@orevaahia
Oreva Ahia
4 months
🎉 We’re excited to introduce BLAB: Brutally Long Audio Bench, the first benchmark for evaluating long-form reasoning in audio LMs across 8 challenging tasks, using 833+ hours of Creative Commons audio. (avg length: 51 minutes).
2
50
178
@ShivamDuggal4
Shivam Duggal
4 months
Compression is the heart of intelligence From Occam to Kolmogorov—shorter programs=smarter representations Meet KARL: Kolmogorov-Approximating Representation Learning. Given an image, token budget T & target quality 𝜖 —KARL finds the smallest t≤T to reconstruct it within 𝜖🧵
14
63
358
@Michael3014018
Chenhao Zheng
5 months
Having trouble dealing with the excessive token number when processing a video? Check out our paper that is accepted by ICCV 2025 with an average score of 5.5! We tokenize video with tokens grounded in trajectories of all objects rather than fix-sized patches. Trained with a
1
26
112
@huyushi98
Yushi Hu
6 months
When cooking multimodal models, one big headache I found is that the evaluation benchmarks are not reliable — especially for tasks like interleaved generation.😢 The authors make big effort on reliability, crafting one eval pipeline for each task. 🔥For example, they even
@jihan_yao
Jihan Yao
6 months
We introduce MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation ✅ Reliable: 94.3% agreement with human judgment ✅ Comprehensive: 4 modality combination × 49 tasks × 937 instructions 🔍Results and Takeaways: > GPT-Image-1 from @OpenAI
0
1
13
@jihan_yao
Jihan Yao
6 months
We introduce MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation ✅ Reliable: 94.3% agreement with human judgment ✅ Comprehensive: 4 modality combination × 49 tasks × 937 instructions 🔍Results and Takeaways: > GPT-Image-1 from @OpenAI
2
22
29
@huyushi98
Yushi Hu
6 months
The most shocking figure: how can an LM calculate 3sqrt(13) correctly up to 15 decimal places? Unless something in pretraining is very similar ... The paper is not questioning RLVR - correct reward works for all LMs. It just questions studies that experimented ONLY with Qwen.
@StellaLisy
Stella Li
6 months
🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: https://t.co/jBPlm7cyhr
1
5
52