Yubin Kim @ybkim95_ai X Profile

Yubin Kim

@ybkim95_ai

Followers

203

Following

44

Media

22

Statuses

31

PhD student @MIT conducting research on Agent(s), Health AI and Safety.

https://t.co/mXK3U4aFST

Cambridge, MA

Joined April 2024

Don't wanna be here? Send us removal request.

Yubin Kim

@ybkim95_ai

8 days

To ensure validity, we recruited medical doctors for a two-phase evaluation. Phase 1: we validate BehaviorBench itself, do tasks make sense clinically? Phase 2: we compare three agents, BehaviorSFT, General SFT, and Zero-Shot + Explicit Instruction for ranking.

0

Yubin Kim

@ybkim95_ai

8 days

To enhance the difficulty of our benchmark, we curated BehaviorBench-Hard, a subset of 297 challenging cases where multiple state-of-the-art models consistently fail.

0

Yubin Kim

@ybkim95_ai

8 days

Medical-purpose models are domain-specific LLMs such as Meditron. Agent-based systems (MedAgents, MDAgents) were powered by Gemini-2.5 Pro. No model achieves saturation, particularly on proactive tasks, validating the benchmark’s continued utility for driving future research.

0

Yubin Kim

@ybkim95_ai

8 days

We also conducted (left) G-Eval on 4 axis and (right) density plot for specificity and implicitness. Baseline model is implicit, skipping context, General SFT becomes too explicit. BehaviorSFT lands in between maintaining specificity while preserving conversational naturalness.

0

Yubin Kim

@ybkim95_ai

8 days

Across the tasks and models, BehaviorSFT model consistently matches or outperforms standard SFT on reactive tasks, and shows notable gains on proactive ones. This shows that the model learns to generalize the ‘initiative’ skill without harming precision on literal questions.

0

Yubin Kim

@ybkim95_ai

8 days

Recent Health AI works either stick to medical QA fine-tuning or head straight to multi-agent collaborations like MedAgents or MDAgents. Yet, both extremes miss a middle ground: controllable, single-agent proactivity. Clinicians want assistants that are helpful but not intrusive.

0

1

Yubin Kim

@ybkim95_ai

8 days

BehaviorBench includes: *Reactive Tasks: can agents handle information when requested directly *Balanced tasks: initiated by specific information but demand a significant cognitive steps. *Proactive tasks: require LLM to use evaluation skills such as explicit error correction

0

Yubin Kim

@ybkim95_ai

8 days

Previous medical benchmarks focuses on knowledge QA, and decision-makings. However, our BehaviorBench evaluates agent's behavior adaptation in real clinical scenarios that contains text, image and table modalities.

0

Yubin Kim

@ybkim95_ai

8 days

Can agents calibrate its own behavior in different contexts? Our BehaviorSFT paper introduces a benchmark and training strategy with behavior tokens that makes agent to adapt proactiveness, learning when to stay reactive and when to speak up. Arxiv: https://t.co/yf3724unaY

10

5

12

Yubin Kim

@ybkim95_ai

2 months

🤖 When and why we use single-agent vs multi-agent system? Our paper reveals this decision can be made based on the input complexity. MAS excel when agents can challenge and verify each other's reasoning in parallel, not just with simple vote. Paper: https://t.co/ZvX41pe8mi

0

2

Yubin Kim

@ybkim95_ai

8 months

🤷‍♂️ When and why do Foundation Models hallucinate or confabulate in healthcare, and what's the real-world impact on medical practice? 💔 Our work tackles this urgent question, defining "Medical Hallucinations" AND revealing experimental results. Paper: https://t.co/tN3W8AGuin

1

6

26

Yubin Kim

@ybkim95_ai

11 months

I will be at #NeurIPS2024 from December 10-16. Thrilled to present our oral paper(MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making) on Friday, December 13th (15:50-16:10 PST). 🔍 Learn more: Project page: https://t.co/bYqxfLpm9Z

0

18

Xuhai “Orson” Xu

@Orson_Xu

1 year

[Please RT📢] SEA Lab ( https://t.co/MKGUTnoOXs) is hiring 1 postdoc in Spring/Fall'25 and 1-2 PhD in Fall'25! We build next-gen #HCI and #HAI techniques for health & medical applications. Visit JOIN US page for more details. FAQs are highly recommended to read before applying.

sea-lab.space

Developing the next generation of human-computer interaction and applied AI technologies for health.

2

77

189

Yubin Kim

@ybkim95_ai

1 year

@chanwoopark20 @HyewonMandyJ I am open to any forms of collaboration for the future work in Healthcare AI domain especially on multi-agent LLM, healthcare AI and wearable sensors. Also, I am actively looking for PhD positions this Fall.

0

6

Yubin Kim

@ybkim95_ai

1 year

@chanwoopark20 @HyewonMandyJ Our ablation show that the adaptive setting outperforms static complexity settings, with 81.2% accuracy on text-only queries. Most text-only queries were high complexity, while image+text and video+text queries were often low complexity, suggesting visual cues simplify decisions.

0

2

Yubin Kim

@ybkim95_ai

1 year

@chanwoopark20 @HyewonMandyJ Our findings show that MDAgents consistently reach consensus across different data modalities. text+video modalities converge quickly, while text+image and text-only modalities show a more gradual alignment. Despite varying speeds, all modality cases eventually converged.

0

2

Yubin Kim

@ybkim95_ai

1 year

@chanwoopark20 @HyewonMandyJ Our ablations reveal that our approach can optimize performance with fewer agents (N=3), improves decision-making at extreme temperatures, and reduces computational costs, making it more efficient and adaptable than Solo and Group settings, especially in complex medical cases.

0

2

Yubin Kim

@ybkim95_ai

1 year

@chanwoopark20 @HyewonMandyJ Solo settings excel in simpler tasks, achieving up to 83.9% accuracy, while group settings outperform in complex, multi-modal tasks, with up to 91.9% accuracy.

0

2

Yubin Kim

@ybkim95_ai

1 year

@chanwoopark20 @HyewonMandyJ Surprisingly, our MDAgents significantly outperforms both Solo and Group setting methods, showing the best performance in 7 out of 10 benchmarks. This comprehends both textual information with high precision and visual data.

0

2

Yubin Kim

@ybkim95_ai

1 year

@chanwoopark20 @HyewonMandyJ MDAgents follows four stages: 1) Medical complexity check to categorize the query 2) Expert recruitment selecting PCC for low and MDT/ICT for moderate and high complexity 3) Initial assessment 4) Collaborative discussion between LLM agents 5) Final decision making by moderator

0

2