Jiwan Chung
@JiwanChung
Followers
88
Following
199
Media
8
Statuses
24
Jiwan Chung, Ph.D. student @ Yonsei University. Researching multimodal machine learning, with a focus on VLMs.
Joined May 2022
🎉Our "FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games" is accepted to #EMNLP2025 Main!🎉 We introduce a benchmark of 2D Flash adventure games (room escape, mystery/detective, visual novel, management) for full story completion. 🧵
1
4
27
I won’t be attending ACL this year, but my excellent colleague Janghan Yoon will be presenting our paper during the poster session. If you have any questions, feel free to reach out to me at jiwan.chung.research@gmail.com.
[ACL 2025] Any-to-any models are often expected to be more coherent across modalities—since they handle image→text and text→image in one unified model. But does this hold up? We test it with ACON. 📄 Paper: https://t.co/5sDal7nx65 📷 data: https://t.co/wHQtAKaH3q
0
0
3
ACON challenges a core assumption in multimodal generation: can your model preserve meaning—not just generate across formats? Now there's a way to find out. 📄 Paper: https://t.co/LunAmj4wPX 💻 data: https://t.co/Uc7WVFK2Ip 🧵[7/7]
huggingface.co
0
0
1
Our findings suggest: Cross-modal generation ≠ consistent latent structure. Shared parameters don’t guarantee semantic coherence unless architectural and training signals support it. 🧵[6/7]
1
0
0
We test 4 any-to-any models: Emu3, Chameleon, Seed-X, VILA-U vs. strong specialist pairs like SDXL + LLaVA. 🧪 What we find: Most any-to-any models are not more consistent Some (Seed-X, VILA-U) show better latent alignment Shared parameters ≠ shared semantics. 🧵[5/7]
1
0
0
We define three types of consistency: 🔁 Cyclic consistency: T→I→T (or I→T→I) should reconstruct the input 📐 Forward equivariance: edits before and after modality transfer should commute 🔄 Conjugated equivariance: editing in-between modalities should keep intent 🧵[4/7]
1
0
0
We introduce ACON, a dataset designed to evaluate cross-modal consistency. It contains: 🖼 1,000 images (500 newly collected) 📝 dense human-written captions 🪄 editing prompts ❓ 10 QA pairs per sample for factual evaluation 🧵[3/7]
1
0
0
Any-to-any models handle both text→image and image→text with shared parameters. If they learn a unified latent space, semantically equivalent inputs across modalities should yield consistent outputs. But this had never been rigorously tested—until now. 🧵[2/7]
1
0
0
[ACL 2025] Any-to-any models are often expected to be more coherent across modalities—since they handle image→text and text→image in one unified model. But does this hold up? We test it with ACON. 📄 Paper: https://t.co/5sDal7nx65 📷 data: https://t.co/wHQtAKaH3q
1
1
4
My team at Microsoft Research, working in multimodal, AI is hiring! Please apply if you are interested in working at the cutting edge of multimodal generative AI.
0
8
25
Let your model look again. 🔁 Point-and-copy is a simple yet powerful tool for MLLMs 🧠 Makes reasoning grounded, interpretable, and more human-like 📄 https://t.co/4EmioeHBti Follow for release updates! 🧵6/6
arxiv.org
When thinking with images, humans rarely rely on a single glance: they revisit visual information repeatedly during reasoning. However, existing models typically process images only once and...
0
0
1
Results: Strong performance across multiple multimodal reasoning benchmarks. v1 (7B) > All 7B baselines v1 ≈ 72B models on MathVista, MathVision, MathVerse 🔥Especially strong in visual math and fine-grained grounding 🧪 Ablation: turning off pointing drops performance by ~9%
0
0
1
Training requires patch-level grounding, but existing vision models lack such annotations. We built v1g, a 300K-example dataset with fine-grained regions (e.g., angle A, line BC), using an automated attention-based step-by-step grounding pipeline. 🧵4/6
0
0
0
How v1 enables 'looking again': ➕Adds 2 linear heads to your existing MLLM. 👉Points to relevant image regions dynamically during reasoning. 📋Copies & Injects visual features as input for the next reasoning step. 💡This gives the model access to the visual patch again. 🧵3/6
0
0
0
Most MLLMs encode the image once and never look back. We found they don’t actively attend to visual tokens during reasoning. But tasks like geometry need multiple looks. v1 enables dynamic re-grounding during reasoning, just like how humans solve visual problems. 🧵2/6
0
0
1
Don't look only once for multimodal reasoning 🧠. We introduce a new multimodal LLM framework, v1, that lets your MLLM look 👀 again—just like humans do. Paper: https://t.co/4EmioeHBti Code: https://t.co/KTIjwkhXF5 🧵1/6
5
9
41
📢I'm thrilled to announce that I’ll be joining @KAIST_AI as an Assistant Professor in 2026, leading the Computation & Cognition (COCO) Lab🤖🧠: https://t.co/ioG9cAs95H We'll be exploring reasoning, learning w/ synthetic data, and social agents! +I'm spending a gap year @nvidia✨
34
24
344
Using public datasets for AI model training may require more than just checking their own license terms. We present NEXUS, a data compliance system with our AI agent, AutoCompliance, for the full tracing of data lifecycle. It enables comprehensive legal risk evaluation of
We are delighted to introduce NEXUS, an Agent AI system that tracks the lifecycle of training datasets used in AI models, comprehensively analyzes legal risks, and assesses potential threats related to dataset usage. NEXUS leverages our AutoCompliance agent to trace the full
0
4
15
🚨New Paper! So o3-mini and R1 seem to excel on math & coding. But how good are they on other domains where verifiable rewards are not easily available, such as theory of mind (ToM)? Do they show similar behavior pattern?🤔What if I told you it's...interesting, like the below?🧵
3
34
129
Presenting our #EMNLP2024 work ! "How to Train Your Fact Verifier: Knowledge Transfer with Multimodal Open Models" 📍 : Riverfront Hall ⏱️ : 11/14, Thu 2:00-3:30 pm
🎉 Happy to announce our previous work has been accepted to #EMNLP2024 Findings ! --- 💥 Want to know how robust fact verification models can be without continual updating? 💥 We examine the limit of fact verification models using a knowledge transfer approach with large
0
4
21