HJCH0 Profile Banner
Justin Cho 조현동 Profile
Justin Cho 조현동

@HJCH0

Followers
926
Following
3K
Media
92
Statuses
580

Contextualizing Human-AI Interactions. NLP PhD candidate @USC_ISI

Los Angeles
Joined October 2018
Don't wanna be here? Send us removal request.
@HJCH0
Justin Cho 조현동
12 days
Can you tell what actions are being mimed in this video? If so, you’re smarter than AI models! Check the last tweet in this thread for answers. In a new paper, we present MIME, which evaluates whether vision language models (VLMs) have a robust understanding of human actions. 🧵
1
7
20
@HJCH0
Justin Cho 조현동
12 days
Answers for the actions in teaser video.1. barbell back squat.2. playing guitar.3. baseball swing.4. put on seatbelt.5. bowling.6. baseball pitch.7. swimming.8. playing piano.9. boxing.10. open door.11. fishing.12. driving.13. pushing.14. archery. Let us know how you did!.13/n.
0
0
1
@HJCH0
Justin Cho 조현동
12 days
That's a wrap, thank you!.📝 Check out our paper "Can Vision Language Models Understand Mimed Actions?", accepted to ACL 2025 Findings, for all the details: 🤗 MIME is on HuggingFace: 👨‍💻 Code: 12/n
Tweet media one
1
0
2
@HJCH0
Justin Cho 조현동
12 days
📢 We'll post monthly updates on the best-performing model until we reach human-level performance on the current version of MIME!.Check the project page for details on how to submit. 11/n.
1
0
1
@HJCH0
Justin Cho 조현동
12 days
🏆 Join the MIME leaderboard!. We encourage VLM developers to test their models on MIME to evaluate whether they actually understand actions or whether they're making guesses based on contextual hints. 10/n
Tweet media one
1
0
1
@HJCH0
Justin Cho 조현동
12 days
✨ Results with MIME suggests that we need to rethink how we train VLMs to equip them with an actual understanding of human actions so that they can be safely used for tasks that require it. This has huge implications for safety and accessibility of VLM-powered applications!.9/n.
1
0
1
@HJCH0
Justin Cho 조현동
12 days
Why do these models fail for what is seemingly so effortless for us?. Chain-of-thought with Gemini 1.5 Flash reveals that in ~70% cases, failures are attributed to incorrect observations of the shown action. This reaffirms that models guess from context, not the motion. 8/n
Tweet media one
1
0
1
@HJCH0
Justin Cho 조현동
12 days
We explore whether chain-of-thought and few-shot can help. They provide some gains with the multiple-choice format, but not so much for few-shot, for which performance remains agonizingly low. 7/n
Tweet media one
1
0
1
@HJCH0
Justin Cho 조현동
12 days
🤖VLMs? Not even close. In the most challenging setting, all models don't even achieve 10%!. They perform much better with background clues and with multiple choice options. Take them away, performance plummets. They’re guessing from context, not understanding the motion. 6/n
Tweet media one
1
0
1
@HJCH0
Justin Cho 조현동
12 days
🙋Humans nail it. They get almost 100% accuracy regardless of camera shifts, weird suits, or distracting backgrounds. No hints from multiple-choice options? No problem. Accuracy with the free-form format is only a bit lower than multiple choice. 5/n
Tweet media one
1
0
1
@HJCH0
Justin Cho 조현동
12 days
Using free characters from Mixamo and license-free background images from Wikimedia Commons, we create 10 variations of each of the 86 actions in MIME to test robustness to character, angle, and background perturbations. Ideally predicted actions should be consistent!.4/n
Tweet media one
1
0
1
@HJCH0
Justin Cho 조현동
12 days
MIME is created with motion capture data of mimed actions that get processed into animated videos. This set up lets us:.✅flexibly alter videos for a systematic analysis of robustness &.✅ avoid test data leakage by easily creating previously unseen samples!.3/n
Tweet media one
1
0
1
@HJCH0
Justin Cho 조현동
12 days
Mime Identification Multimodal Evaluation (MIME) is a video-based question answering benchmark that tests mime identification: suggesting intent only with movement and expression. Doing well on MIME requires understanding human actions *without* any salient context. 2/n
Tweet media one
1
0
2
@HJCH0
Justin Cho 조현동
5 months
RT @USC_ISI: Our next NL seminar is this Thursday!. Justin Cho (@HJCH0) is a PhD Candidate at USC. In this talk, he'll present research tha….
0
1
0
@HJCH0
Justin Cho 조현동
5 months
save coffee for only when you really need it and it will do wonders.
1
0
5
@HJCH0
Justin Cho 조현동
7 months
RT @AlexanderSpangh: ✨✨✨Hello everyone, I’m on the faculty job market this year.✨✨✨ I’m completing my PhD at USC, where I study agentic pla….
0
20
0
@HJCH0
Justin Cho 조현동
8 months
RT @AlexanderSpangh: 🚨🚨🚨 New paper drop! 🚨🚨🚨 . If you’re a researcher, you’d probably like at least **some** of your work to get covered by….
0
6
0
@HJCH0
Justin Cho 조현동
8 months
I'm presenting this work during today's poster session from 10:30AM-12PM at EMNLP! Come by and say hi 👋.
@HJCH0
Justin Cho 조현동
8 months
✨EMNLP Paper ✨.Wouldn't it be great if we can also listen to LLM responses when we can't look at a screen? .Problem: LLMs generate responses without considering the unique constraints of speech 😢. 🎉 Let's fix that with Speechworthy Instruction-tuned Language Models
Tweet media one
0
2
19
@HJCH0
Justin Cho 조현동
8 months
This work was possible thanks to my great collaborators from @amazon and @USC_ISI! . Check out our paper: All data and code will be available at
0
0
0
@HJCH0
Justin Cho 조현동
8 months
✅ That's a wrap! In this work, we’ve focused on what LLMs should say for speech-based interactions, but not how (timber, speed, pitch, etc.) the response should be verbalized. We look forward to future work that focuses on the how and multi-turn speech-based interactions!.
1
0
0