 
            
              Yeda Song
            
            @__runamu__
Followers
                182
              Following
                309
              Media
                3
              Statuses
                20
              Multimodal Agents for the Real World: GUI Agents, VLM, and RL @ UMich 🇺🇸
              
              Ann Arbor, Michigan, USA
            
            
              
              Joined January 2022
            
            
           🔥 GUI agents struggle with real-world mobile tasks. We present MONDAY—a diverse, large-scale dataset built via an automatic pipeline that transforms internet videos into GUI agent data. ✅ VLMs trained on MONDAY show strong generalization ✅ Open data (313K steps) (1/7) 🧵 #CVPR
          
          
                
                2
              
              
                
                15
              
              
                
                48
              
             🚨🚨New paper on core RL: a way to train value-functions via flow-matching for scaling compute! No text/images, but a flow directly on a scalar Q-value. This unlocks benefits of iterative compute, test-time scaling for value prediction & SOTA results on whatever we tried. 🧵⬇️ 
          
                
                11
              
              
                
                83
              
              
                
                707
              
             Flow Q-learning (FQL) is a simple method to train/fine-tune an expressive flow policy with RL. Come visit our poster at 4:30p-7p this Wed (evening session, 2nd day)! 
           Excited to introduce flow Q-learning (FQL)! Flow Q-learning is a *simple* and scalable data-driven RL method that trains an expressive policy with flow matching. Paper:  https://t.co/kjaeqHcBFh  Project page:  https://t.co/D8vFcZib1F  Thread ↓ 
            
                
                5
              
              
                
                68
              
              
                
                505
              
             ✨Two life updates✨ 1. Started my internship at @LG_AI_Research in Ann Arbor, Michigan — Advancing AI for a better life! 🔮 2. Advanced to PhD candidacy at UMich CSE. This means I’ve completed my coursework and passed the qualification process. 🙌 
          
                
                3
              
              
                
                1
              
              
                
                144
              
             The race for LLM "cognitive core" - a few billion param model that maximally sacrifices encyclopedic knowledge for capability. It lives always-on and by default on every computer as the kernel of LLM personal computing. Its features are slowly crystalizing: - Natively multimodal 
           I’m so excited to announce Gemma 3n is here! 🎉 🔊Multimodal (text/audio/image/video) understanding 🤯Runs with as little as 2GB of RAM 🏆First model under 10B with @lmarena_ai score of 1300+ Available now on @huggingface, @kaggle, llama.cpp,  https://t.co/CNDy479EEv,  and more 
            
                
                397
              
              
                
                1K
              
              
                
                11K
              
             Can scaling data and models alone solve computer vision? 🤔 Join us at the SP4V Workshop at #ICCV2025 in Hawaii to explore this question! 🎤 Speakers: @danfei_xu, @joaocarreira, @jiajunwu_cs, Kristen Grauman, @sainingxie, @vincesitzmann 🔗  https://t.co/pH1Qjc1Kr2 
          
          
                
                2
              
              
                
                17
              
              
                
                93
              
             🚀 Excited to announce our 4th Workshop on Computer Vision in the Wild (CVinW) at @CVPR 2025! 🔗  https://t.co/Z5r48oh6iv  ⭐We have invinted a great lineup of speakers: Prof. Kaiming He, Prof. @BoqingGo, Prof. @CordeliaSchmid, Prof. @RanjayKrishna, Prof. @sainingxie, Prof. 
          
                
                1
              
              
                
                27
              
              
                
                103
              
             Arrived in Nashville for #CVPR 🤠 Excited to present MONDAY, a collaboration with @LG_AI_Research! 📍 MMFM Workshop - Thu, 9:40 AM 📍 Main Conference - Fri, 4:00 PM Let’s connect and chat!🤝 Also exploring Summer 2026 internships 🔍 MONDAY website: 
          
                
                0
              
              
                
                1
              
              
                
                12
              
             MONDAY is right here for you: Open dataset & usage code 👉  https://t.co/rwJeaAz2t5  Big thanks to our amazing collaborators, @YunseokJANG, @sungryulls, @lajanugen, @tiangeluo, Dong-Ki Kim, Kyunghoon Bae, and @honglaklee. 🎸 Catch our poster presentations at #CVPR2025! (7/7) 
          
                
                0
              
              
                
                0
              
              
                
                2
              
             And it works: 📈 Vision-language models trained on MONDAY show an average +18% gain on an unseen mobile OS, along with consistent boosts on AitW, AMEX, and our own test set. We evaluated this using SeeClick (9.6B) and Llama-3.2-11B-Vision-Instruct as baseline models. (6/7) 
          
                
                1
              
              
                
                1
              
              
                
                3
              
             We achieved this with our robust, fully automated pipeline: 🔹 OCR-based scene detection (95% F1), outperforming vision-based approaches 🔹 Near-perfect UI element detection (99.9% hit rate) 🔹 Novel 3-step action identification using VLMs for precise, context-aware labels (5/7) 
          
                
                1
              
              
                
                0
              
              
                
                1
              
             MONDAY solves this by turning internet videos into useful data: 📱Real-world and diverse 🔁 Easy to expand with new videos 💸17× cheaper than manual annotation ($0.34 vs $5.76/video) No manual annotation. No system access needed. Just authentic human interactions at scale. (4/7) 
          
                
                1
              
              
                
                0
              
              
                
                1
              
             GUI agents fail in the wild because existing training datasets ❌ lack diversity across mobile OS platforms, apps, & user configs ❌ get quickly outdated ❌ are too costly to scale (3/7) 
          
                
                1
              
              
                
                0
              
              
                
                1
              
             "Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents" Project:  https://t.co/rwJeaAz2t5  Code:  https://t.co/qxwrM15AMX  Data:  https://t.co/mjTvnStiIG  Paper:  https://t.co/fA4IofYTgX  (2/7) #GUIAgent #CUA #CVPR #CVPR2025
          
          
            
            huggingface.co
            
                
                1
              
              
                
                0
              
              
                
                2
              
             I finally wrote another blogpost:  https://t.co/WddJkbSfks  AI just keeps getting better over time, but NOW is a special moment that i call “the halftime”. Before it, training > eval. After it, eval > training. The reason: RL finally works. Lmk ur feedback so I’ll polish it. 
          
                
                38
              
              
                
                210
              
              
                
                1K
              
             LLM chatbots are moving fast, but how do we make them better? In my new blog at The Gradient, I argue that an important next step is giving them a sense of "purpose." 
          
                
                1
              
              
                
                8
              
              
                
                26
              
             I love our Michigan AI Lab @michigan_AI! A group of people who not only does some of the coolest research in AI, but also care for and of each other, and enjoy each other’s company. A picture from this week’s fun picnic. ❤️ 
          
                
                1
              
              
                
                6
              
              
                
                124
              
             Glad to share our work at #ACL2023, "MPChat: Towards Multimodal Persona-Grounded Conversation"  https://t.co/S8US4LaYr5  ! #multimodal #persona_chat authors: @AHNJAEWOO2, @__runamu__, Gunhee Kim 
          
            
            arxiv.org
              In order to build self-consistent personalized dialogue agents, previous research has mostly focused on textual persona that delivers personal facts or personalities. However, to fully describe...
            
                
                0
              
              
                
                4
              
              
                
                8
              
             
             
             
             
               
             
             
             
             
             
             
            