Justin Cho 조현동 @HJCH0 X Profile

Justin Cho 조현동

@HJCH0

Followers

926

Following

3K

Media

92

Statuses

580

Contextualizing Human-AI Interactions. NLP PhD candidate @USC_ISI

Los Angeles

Joined October 2018

Don't wanna be here? Send us removal request.

Justin Cho 조현동

@HJCH0

12 days

Can you tell what actions are being mimed in this video? If so, you’re smarter than AI models! Check the last tweet in this thread for answers. In a new paper, we present MIME, which evaluates whether vision language models (VLMs) have a robust understanding of human actions. 🧵

1

7

20

Justin Cho 조현동

@HJCH0

12 days

Answers for the actions in teaser video.1. barbell back squat.2. playing guitar.3. baseball swing.4. put on seatbelt.5. bowling.6. baseball pitch.7. swimming.8. playing piano.9. boxing.10. open door.11. fishing.12. driving.13. pushing.14. archery. Let us know how you did!.13/n.

0

1

Justin Cho 조현동

@HJCH0

12 days

That's a wrap, thank you!.📝 Check out our paper "Can Vision Language Models Understand Mimed Actions?", accepted to ACL 2025 Findings, for all the details: 🤗 MIME is on HuggingFace: 👨‍💻 Code: 12/n

1

0

2

Justin Cho 조현동

@HJCH0

12 days

📢 We'll post monthly updates on the best-performing model until we reach human-level performance on the current version of MIME!.Check the project page for details on how to submit. 11/n.

1

0

1

Justin Cho 조현동

@HJCH0

12 days

🏆 Join the MIME leaderboard!. We encourage VLM developers to test their models on MIME to evaluate whether they actually understand actions or whether they're making guesses based on contextual hints. 10/n

1

0

1

Justin Cho 조현동

@HJCH0

12 days

✨ Results with MIME suggests that we need to rethink how we train VLMs to equip them with an actual understanding of human actions so that they can be safely used for tasks that require it. This has huge implications for safety and accessibility of VLM-powered applications!.9/n.

1

0

1

Justin Cho 조현동

@HJCH0

12 days

Why do these models fail for what is seemingly so effortless for us?. Chain-of-thought with Gemini 1.5 Flash reveals that in ~70% cases, failures are attributed to incorrect observations of the shown action. This reaffirms that models guess from context, not the motion. 8/n

1

0

1

Justin Cho 조현동

@HJCH0

12 days

We explore whether chain-of-thought and few-shot can help. They provide some gains with the multiple-choice format, but not so much for few-shot, for which performance remains agonizingly low. 7/n

1

0

1

Justin Cho 조현동

@HJCH0

12 days

🤖VLMs? Not even close. In the most challenging setting, all models don't even achieve 10%!. They perform much better with background clues and with multiple choice options. Take them away, performance plummets. They’re guessing from context, not understanding the motion. 6/n

1

0

1

Justin Cho 조현동

@HJCH0

12 days

🙋Humans nail it. They get almost 100% accuracy regardless of camera shifts, weird suits, or distracting backgrounds. No hints from multiple-choice options? No problem. Accuracy with the free-form format is only a bit lower than multiple choice. 5/n

1

0

1

Justin Cho 조현동

@HJCH0

12 days

Using free characters from Mixamo and license-free background images from Wikimedia Commons, we create 10 variations of each of the 86 actions in MIME to test robustness to character, angle, and background perturbations. Ideally predicted actions should be consistent!.4/n

1

0

1

Justin Cho 조현동

@HJCH0

12 days

MIME is created with motion capture data of mimed actions that get processed into animated videos. This set up lets us:.✅flexibly alter videos for a systematic analysis of robustness &.✅ avoid test data leakage by easily creating previously unseen samples!.3/n

1

0

1

Justin Cho 조현동

@HJCH0

12 days

Mime Identification Multimodal Evaluation (MIME) is a video-based question answering benchmark that tests mime identification: suggesting intent only with movement and expression. Doing well on MIME requires understanding human actions *without* any salient context. 2/n

1

0

2

Justin Cho 조현동

@HJCH0

5 months

RT @USC_ISI: Our next NL seminar is this Thursday!. Justin Cho (@HJCH0) is a PhD Candidate at USC. In this talk, he'll present research tha….

0

1

0

Justin Cho 조현동

@HJCH0

5 months

save coffee for only when you really need it and it will do wonders.

1

0

5

Justin Cho 조현동

@HJCH0

7 months

RT @AlexanderSpangh: ✨✨✨Hello everyone, I’m on the faculty job market this year.✨✨✨ I’m completing my PhD at USC, where I study agentic pla….

0

20

0

Justin Cho 조현동

@HJCH0

8 months

RT @AlexanderSpangh: 🚨🚨🚨 New paper drop! 🚨🚨🚨 . If you’re a researcher, you’d probably like at least **some** of your work to get covered by….

0

6

0

Justin Cho 조현동

@HJCH0

8 months

I'm presenting this work during today's poster session from 10:30AM-12PM at EMNLP! Come by and say hi 👋.

Justin Cho 조현동

@HJCH0

8 months

✨EMNLP Paper ✨.Wouldn't it be great if we can also listen to LLM responses when we can't look at a screen? .Problem: LLMs generate responses without considering the unique constraints of speech 😢. 🎉 Let's fix that with Speechworthy Instruction-tuned Language Models

0

2

19

Justin Cho 조현동

@HJCH0

8 months

This work was possible thanks to my great collaborators from @amazon and @USC_ISI! . Check out our paper: All data and code will be available at

0

Justin Cho 조현동

@HJCH0

8 months

✅ That's a wrap! In this work, we’ve focused on what LLMs should say for speech-based interactions, but not how (timber, speed, pitch, etc.) the response should be verbalized. We look forward to future work that focuses on the how and multi-turn speech-based interactions!.

1

0