Daeun Lee
@danadaeun
Followers
506
Following
441
Media
24
Statuses
430
PhD student @unccs advised by @mohitban47 | Intern @AdobeResearch | Multimodal, Video, Embodied AI, Post-training
United States
Joined February 2024
π€ We rely on gaze to guide our actions, but can current MLLMs truly understand it and infer our intentions? Introducing StreamGaze π, the first benchmark that evaluates gaze-guided temporal reasoning (past, present, and future) and proactive understanding in streaming video
1
23
41
π We introduce Soft Adaptive Policy Optimization (SAPO) β a smooth, stable, and highly effective RL method for training large language models. Why SAPO? πΉ Hard clipping is brittle β gradients vanish or explode πΉ MoE models amplify variance, making training even more unstable
arxiv.org
Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains...
17
124
947
Holy shitβ¦ Meta mightβve just solved self-improving AI π€― Their new paper SPICE (Self-Play in Corpus Environments) basically turns a language model into its own teacher no humans, no labels, no datasets just the internet as its training ground. Hereβs the twist: one copy of
37
67
448
Check out my labmate Ziyangβs impressive long-video reasoning agentic framework! π Similar to human perception, this Active Video Perception system integrates planning, observation, and reflection across multiple agents, effectively leveraging temporal evidence for long-video
π¨ Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding π¨ Introducing Active Video Perception: an evidence-seeking framework that treats the video as an interactive environment and acquires compact, query-relevant evidence. π¬ Key
1
1
6
Introducing AVP - our new multimodal agent for long video understanding!
π¨ Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding π¨ Introducing Active Video Perception: an evidence-seeking framework that treats the video as an interactive environment and acquires compact, query-relevant evidence. π¬ Key
5
11
42
Super excited about the idea - so simple yet so smart and powerful! π The old passive video-perception setup just doesn't make sense anymore. Grabbing all visual info once, with fixed granularity and no query awareness, is inefficient and overloads the model. So we built Active
π¨ Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding π¨ Introducing Active Video Perception: an evidence-seeking framework that treats the video as an interactive environment and acquires compact, query-relevant evidence. π¬ Key
2
6
17
Soon - I will chat about multimodal AI with @nikparth1 @sainingxie @RanjayKrishna at DCVLR workshop! Location: Upper Level Ballroom 6DE
1
5
39
You're in a Research Scientist interview at Google. Interviewer: We have a base LLM that's terrible at maths. How would you turn it into a maths & reasoning powerhouse? You: I'll get some problems labeled and fine-tune the model. Interview over. Here's what you missed:
38
70
1K
Want to be an intern at Microsoft Research in the Computational Social Science group in NYC (Jake Hofman, David Rothschild, Dan Goldstein) Follow this link and do your thing! Deadline approaching soonish! https://t.co/wKFBQmhLzt
apply.careers.microsoft.com
Research Interns put inquiry and theory into practice. Alongside fellow doctoral candidates and some of the world's best researchers, Research Interns learn, collaborate, and network for life....
7
28
138
Thanks so much for the overwhelming interest ππ»ββοΈ Apologies if I canβt respond to everyone right away β please keep the messages coming and apply on the portal! Iβll do my best to reply over the next few days. π
π’πͺπ Weβre hiring amazing interns to join a focused, driven team pushing the frontier of agentic LLMs at Ai2 β training, evaluation, tool-use, memory, safety, theory, and more. #NeurIPS2025 Apply here or message me:
1
1
13
How does RL improve OOD reasoning? How can we distinguish compositional generalization from length generalization? What makes a composition more learnable? Check out our #neurips2025 workshop poster tomorrow! ποΈSat, 12/6, 8am-5pm Efficient Reasoning πExhibit Hall F (Spotlight)
arxiv.org
While reinforcement learning (RL) successfully enhances reasoning in large language models, its role in fostering compositional generalization (the ability to synthesize novel skills from known...
0
25
157
Super interesting work! I like the idea of leveraging human gaze as a temporal prior for long-horizon egocentric reasoning. Excited to see benchmarks driving the next wave of grounded, real-time video reasoning! Check more hereπ
π€ We rely on gaze to guide our actions, but can current MLLMs truly understand it and infer our intentions? Introducing StreamGaze π, the first benchmark that evaluates gaze-guided temporal reasoning (past, present, and future) and proactive understanding in streaming video
1
3
7
π¨Check Daeun's new cool work with Adobe Research: StreamGaze, the first comprehensive benchmark for gaze-guided streaming video understanding. Itβs the first to test not only past/present comprehension but proactive reasoning based on real-time human gaze in the streaming
π€ We rely on gaze to guide our actions, but can current MLLMs truly understand it and infer our intentions? Introducing StreamGaze π, the first benchmark that evaluates gaze-guided temporal reasoning (past, present, and future) and proactive understanding in streaming video
2
3
9
Excited to share our work on understanding streaming video. Check our paper and dataset!
π€ We rely on gaze to guide our actions, but can current MLLMs truly understand it and infer our intentions? Introducing StreamGaze π, the first benchmark that evaluates gaze-guided temporal reasoning (past, present, and future) and proactive understanding in streaming video
0
1
1
A neat VLM benchmark for gaze-guided streaming video understanding! e.g., predicting user intents in real time with AR glasses π
π€ We rely on gaze to guide our actions, but can current MLLMs truly understand it and infer our intentions? Introducing StreamGaze π, the first benchmark that evaluates gaze-guided temporal reasoning (past, present, and future) and proactive understanding in streaming video
0
5
14
If you are interested in research at NVIDIA also welcome to DM for chat! More about our team:
research.nvidia.com
Advancing foundational technologies enabling AI systems to perceive, model, and interact with the physical world.
8
15
164
Huge thanks to an amazing collaboration with Subhojyoti Mukherjee, Branislav Kveton, Ryan A. Rossi, Viet Dac Lai, @david_s_yoon, Trung Bui, @f_dernoncourt, and @mohitban47 β€οΈ (@AdobeResearch @unccs @unc_ai_group) - π Paper: https://t.co/f8ZJNUEejo π Project Page:
arxiv.org
Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications like AR glasses. While prior...
0
0
4
π Additional Analysis 2: How do MLLMs use gaze signals during inference? We ablate the contributions of text-, gaze-, and visual-based reasoning in GPT-4o. Combining gaze and visual reasoning provides the best overall performance, though visual cues improve tasks like Scene
1
0
3
π Additional Analysis 1: Ablation of Gaze Input Prompting We evaluate multiple strategies for injecting gaze signals into Qwen2.5-VL (7B). While salience maps outperform other prompting methods, a more adaptive and task-aware mechanism is needed to fully capture the diverse
1
0
4