Yubin Kim
@ybkim95_ai
Followers
203
Following
44
Media
22
Statuses
31
PhD student @MIT conducting research on Agent(s), Health AI and Safety.
Cambridge, MA
Joined April 2024
To ensure validity, we recruited medical doctors for a two-phase evaluation. Phase 1: we validate BehaviorBench itself, do tasks make sense clinically? Phase 2: we compare three agents, BehaviorSFT, General SFT, and Zero-Shot + Explicit Instruction for ranking.
0
0
0
To enhance the difficulty of our benchmark, we curated BehaviorBench-Hard, a subset of 297 challenging cases where multiple state-of-the-art models consistently fail.
0
0
0
Medical-purpose models are domain-specific LLMs such as Meditron. Agent-based systems (MedAgents, MDAgents) were powered by Gemini-2.5 Pro. No model achieves saturation, particularly on proactive tasks, validating the benchmark’s continued utility for driving future research.
0
0
0
We also conducted (left) G-Eval on 4 axis and (right) density plot for specificity and implicitness. Baseline model is implicit, skipping context, General SFT becomes too explicit. BehaviorSFT lands in between maintaining specificity while preserving conversational naturalness.
0
0
0
Across the tasks and models, BehaviorSFT model consistently matches or outperforms standard SFT on reactive tasks, and shows notable gains on proactive ones. This shows that the model learns to generalize the ‘initiative’ skill without harming precision on literal questions.
0
0
0
Recent Health AI works either stick to medical QA fine-tuning or head straight to multi-agent collaborations like MedAgents or MDAgents. Yet, both extremes miss a middle ground: controllable, single-agent proactivity. Clinicians want assistants that are helpful but not intrusive.
0
0
1
BehaviorBench includes: *Reactive Tasks: can agents handle information when requested directly *Balanced tasks: initiated by specific information but demand a significant cognitive steps. *Proactive tasks: require LLM to use evaluation skills such as explicit error correction
0
0
0
Previous medical benchmarks focuses on knowledge QA, and decision-makings. However, our BehaviorBench evaluates agent's behavior adaptation in real clinical scenarios that contains text, image and table modalities.
0
0
0
Can agents calibrate its own behavior in different contexts? Our BehaviorSFT paper introduces a benchmark and training strategy with behavior tokens that makes agent to adapt proactiveness, learning when to stay reactive and when to speak up. Arxiv: https://t.co/yf3724unaY
10
5
12
🤖 When and why we use single-agent vs multi-agent system? Our paper reveals this decision can be made based on the input complexity. MAS excel when agents can challenge and verify each other's reasoning in parallel, not just with simple vote. Paper: https://t.co/ZvX41pe8mi
0
0
2
🤷♂️ When and why do Foundation Models hallucinate or confabulate in healthcare, and what's the real-world impact on medical practice? 💔 Our work tackles this urgent question, defining "Medical Hallucinations" AND revealing experimental results. Paper: https://t.co/tN3W8AGuin
1
6
26
I will be at #NeurIPS2024 from December 10-16. Thrilled to present our oral paper(MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making) on Friday, December 13th (15:50-16:10 PST). 🔍 Learn more: Project page: https://t.co/bYqxfLpm9Z
0
0
18
[Please RT📢] SEA Lab ( https://t.co/MKGUTnoOXs) is hiring 1 postdoc in Spring/Fall'25 and 1-2 PhD in Fall'25! We build next-gen #HCI and #HAI techniques for health & medical applications. Visit JOIN US page for more details. FAQs are highly recommended to read before applying.
sea-lab.space
Developing the next generation of human-computer interaction and applied AI technologies for health.
2
77
189
@chanwoopark20 @HyewonMandyJ I am open to any forms of collaboration for the future work in Healthcare AI domain especially on multi-agent LLM, healthcare AI and wearable sensors. Also, I am actively looking for PhD positions this Fall.
0
0
6
@chanwoopark20 @HyewonMandyJ Our ablation show that the adaptive setting outperforms static complexity settings, with 81.2% accuracy on text-only queries. Most text-only queries were high complexity, while image+text and video+text queries were often low complexity, suggesting visual cues simplify decisions.
0
0
2
@chanwoopark20 @HyewonMandyJ Our findings show that MDAgents consistently reach consensus across different data modalities. text+video modalities converge quickly, while text+image and text-only modalities show a more gradual alignment. Despite varying speeds, all modality cases eventually converged.
0
0
2
@chanwoopark20 @HyewonMandyJ Our ablations reveal that our approach can optimize performance with fewer agents (N=3), improves decision-making at extreme temperatures, and reduces computational costs, making it more efficient and adaptable than Solo and Group settings, especially in complex medical cases.
0
0
2
@chanwoopark20 @HyewonMandyJ Solo settings excel in simpler tasks, achieving up to 83.9% accuracy, while group settings outperform in complex, multi-modal tasks, with up to 91.9% accuracy.
0
0
2
@chanwoopark20 @HyewonMandyJ Surprisingly, our MDAgents significantly outperforms both Solo and Group setting methods, showing the best performance in 7 out of 10 benchmarks. This comprehends both textual information with high precision and visual data.
0
0
2
@chanwoopark20 @HyewonMandyJ MDAgents follows four stages: 1) Medical complexity check to categorize the query 2) Expert recruitment selecting PCC for low and MDT/ICT for moderate and high complexity 3) Initial assessment 4) Collaborative discussion between LLM agents 5) Final decision making by moderator
0
0
2