Yushi Hu @huyushi98 X Profile

Yushi Hu

@huyushi98

Followers

3K

Following

2K

Media

28

Statuses

243

Research Scientist @Meta FAIR｜PhD @uwnlp | Prev. @allen_ai @GoogleAI @UChicago | We still have a lot to do on multimodal intelligence

https://t.co/fvFHJKWrYZ

Seattle, WA

Joined November 2020

Don't wanna be here? Send us removal request.

Yushi Hu

@huyushi98

7 months

Excited to see the image reasoning in o3 and o4-mini!!🤩 We introduced this idea a year ago in visual Sketchpad ( https://t.co/v5TYvzyjGM). Excited to see @OpenAI baking this into their model through agentic RL. Great work! And yes, reasoning should be multimodal! Huge shoutout

OpenAI

@OpenAI

7 months

Introducing OpenAI o3 and o4-mini—our smartest and most capable models to date. For the first time, our reasoning models can agentically use and combine every tool within ChatGPT, including web search, Python, image analysis, file interpretation, and image generation.

2

8

70

AI at Meta

@AIatMeta

3 days

Today we’re excited to unveil a new generation of Segment Anything Models: 1️⃣ SAM 3 enables detecting, segmenting and tracking of objects across images and videos, now with short text phrases and exemplar prompts. 🔗 Learn more about SAM 3: https://t.co/tIwymSSD89 2️⃣ SAM 3D

107

588

4K

AI at Meta

@AIatMeta

3 days

Introducing SAM 3D, the newest addition to the SAM collection, bringing common sense 3D understanding of everyday images. SAM 3D includes two models: 🛋️ SAM 3D Objects for object and scene reconstruction 🧑‍🤝‍🧑 SAM 3D Body for human pose and shape estimation Both models achieve

128

1K

6K

Conference on Language Modeling

@COLM_conf

18 days

COLM Keynotes: Luke Zettlemoyer Mixed-modal Language Modeling https://t.co/8FdhhrfOnG

0

20

149

Jiasen Lu

@jiasenlu

1 month

🚀Our Code and Weight is public available. We provide detailed examples, and try out at https://t.co/0jE0CrSTLY

Aran Komatsuzaki

@arankomatsuzaki

2 months

Apple presents AToken: A unified visual tokenizer • First tokenizer unifying images, videos & 3D • Shared 4D latent space (preserves both reconstruction & semantics) • Strong across gen & understanding tasks (ImageNet 82.2%, MSRVTT 32.6%, 3D acc 90.9%)

3

36

237

Yushi Hu

@huyushi98

1 month

We are hiring a PhD research intern (summer 2026) at Meta FAIR to work on frontier multimodal generation models! Apply here: https://t.co/M5htDrYUBR Feel free to DM me if you have any questions!

4

44

312

Marjan Ghazvininejad

@gh_marjan

1 month

We’re hiring a Research Intern (Summer 2026) at FAIR to advance the frontiers of Multimodal Generative AI. Learn more and apply here: https://t.co/aVRV04lp6V If you have questions, feel free to DM me or meet me at the RAM 2: Reasoning, Attention & Memory workshop at COLM.

3

41

349

Xiaochuang Han

@XiaochuangHan

1 month

Our team at Meta FAIR is hiring a PhD research intern for 2026. The topics broadly involve multimodal generative AI (e.g., video/image generation in addition to text), with flexible approaches across architecture/data/algorithms. Please apply via the link below, and feel free to

3

43

257

John Nguyen

@__JohnNguyen__

2 months

Transfusion combines autoregressive with diffusion to train a single transformer, but what if we combine Flow with Flow? 🤔 🌊OneFlow🌊 the first non-autoregressive model to generate text and images concurrently using a single transformer—unifying Edit Flow (text) with Flow

7

83

412

Tao Yu

@taoyds

3 months

As computer-use agents (CUAs) handle critical digital tasks, open research is key to study their capabilities, risks. 🚀After a year, we release OpenCUA: 1) largest CUA dataset/tool, 2) training recipe, 3) ~SOTA model on OSWorld. Released to drive transparent,safe CUA research!

Xinyuan Wang

@xywang626

3 months

We are super excited to release OpenCUA — the first from 0 to 1 computer-use agent foundation model framework and open-source SOTA model OpenCUA-32B, matching top proprietary models on OSWorld-Verified, with full infrastructure and data. 🔗 [Paper] https://t.co/naBIDnyvYY 📌

2

27

133

AI at Meta

@AIatMeta

3 months

Introducing DINOv3: a state-of-the-art computer vision model trained with self-supervised learning (SSL) that produces powerful, high-resolution image features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense

360

789

5K

Logan Kilpatrick

@OfficialLoganK

4 months

Introducing Genie 3, the most advanced world simulator ever created, enabled by numerous research breakthroughs. 🤯 Featuring high fidelity visuals, 20-24 fps, prompting on the go, world memory, and more.

563

1K

9K

Oreva Ahia

@orevaahia

4 months

🎉 We’re excited to introduce BLAB: Brutally Long Audio Bench, the first benchmark for evaluating long-form reasoning in audio LMs across 8 challenging tasks, using 833+ hours of Creative Commons audio. (avg length: 51 minutes).

2

50

178

Shivam Duggal

@ShivamDuggal4

4 months

Compression is the heart of intelligence From Occam to Kolmogorov—shorter programs=smarter representations Meet KARL: Kolmogorov-Approximating Representation Learning. Given an image, token budget T & target quality 𝜖 —KARL finds the smallest t≤T to reconstruct it within 𝜖🧵

14

63

358

Chenhao Zheng

@Michael3014018

5 months

Having trouble dealing with the excessive token number when processing a video? Check out our paper that is accepted by ICCV 2025 with an average score of 5.5! We tokenize video with tokens grounded in trajectories of all objects rather than fix-sized patches. Trained with a

1

26

112

Yushi Hu

@huyushi98

6 months

When cooking multimodal models, one big headache I found is that the evaluation benchmarks are not reliable — especially for tasks like interleaved generation.😢 The authors make big effort on reliability, crafting one eval pipeline for each task. 🔥For example, they even

Jihan Yao

@jihan_yao

6 months

We introduce MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation ✅ Reliable: 94.3% agreement with human judgment ✅ Comprehensive: 4 modality combination × 49 tasks × 937 instructions 🔍Results and Takeaways: > GPT-Image-1 from @OpenAI

0

1

13

Jihan Yao

@jihan_yao

6 months

We introduce MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation ✅ Reliable: 94.3% agreement with human judgment ✅ Comprehensive: 4 modality combination × 49 tasks × 937 instructions 🔍Results and Takeaways: > GPT-Image-1 from @OpenAI

2

22

29

Yushi Hu

@huyushi98

6 months

The most shocking figure: how can an LM calculate 3sqrt(13) correctly up to 15 decimal places? Unless something in pretraining is very similar ... The paper is not questioning RLVR - correct reward works for all LMs. It just questions studies that experimented ONLY with Qwen.

Stella Li

@StellaLisy

6 months

🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: https://t.co/jBPlm7cyhr

1

5

52