Alyssa Unell @AlyssaUnell X Profile

Alyssa Unell

@AlyssaUnell

Followers

225

Following

202

Media

5

Statuses

56

CS PhD Student @StanfordAILab. Previously @MIT_CSAIL. Robustness and reliability of ML for healthcare.

Joined November 2023

Don't wanna be here? Send us removal request.

Alyssa Unell

@AlyssaUnell

15 days

RT @dorazhao9: There’s a lot of speculation around how AI will change human relationships. To dig into this question, we collect surveys fr….

0

9

0

Alyssa Unell

@AlyssaUnell

30 days

🌟Big thank you to all of our collaborators as well as my amazing co-first authors @BediSuhana42170 @HennyJieCC @_Miguel_Fuentes for their incredibly hard work and amazing execution!.

0

3

Alyssa Unell

@AlyssaUnell

30 days

🚀 Join the future of medical AI evaluation - everything is open source and ready for collaboration:.📊 Interactive leaderboard: 💻Complete codebase & docs: 📄 Full paper:

1

0

1

Alyssa Unell

@AlyssaUnell

30 days

🌟 Why this matters: Healthcare systems need task-specific performance data—not just exam scores—to deploy AI systems safely. MedHELM provides the standardized evaluation framework the medical AI community has been missing.

1

0

2

Alyssa Unell

@AlyssaUnell

30 days

📊 Performance varies by category:.🔹Clinical Note Generation: 0.74-0.85 (strong).🔹Patient Communication & Education: 0.76-0.89 (strong).🔹Medical Research: 0.65-0.75 (moderate).🔹Clinical Decision Support: 0.61-0.76 (moderate).🔹Admin & Workflow: 0.53-0.63 (weakest).

1

0

2

Alyssa Unell

@AlyssaUnell

30 days

🔬 We evaluated 9 frontier LLMs and found surprising performance gaps:.🏆 Top performers: DeepSeek R1 (66% win-rate), o3-mini (64%) .💰 Best value: Claude 3.5 Sonnet (comparable performance at 40% lower cost).Not all "medical AI" is created equal.

1

0

2

Alyssa Unell

@AlyssaUnell

30 days

💊 MedHELM Solution: We built a clinician-validated taxonomy with 29 physicians covering:.✅5 categories.✅22 subcategories.✅121 specific medical tasks.We evaluate 35 benchmarks (17 existing + 18 new) spanning ALL categories and subcategories.

1

0

3

Alyssa Unell

@AlyssaUnell

30 days

⚕️ Current medical AI benchmarks have 3 critical flaws:.❌Synthetic scenarios ≠ real clinical complexity.❌Only 5% use actual EHR data.❌Focus on exams, not daily workflows (admin tasks, documentation, patient communication).We need better evaluation standards.

1

0

2

Alyssa Unell

@AlyssaUnell

30 days

🧵 🩺 LLMs score ~99% on medical licensing exams, but are they ready for real medical deployment? Our new research reveals major gaps between test performance and clinical readiness. Introducing MedHELM: Holistic Evaluation of Large Language Models for Medical Applications👇

2

7

33

Alyssa Unell

@AlyssaUnell

2 months

RT @rose_e_wang: I defended my PhD from Stanford CS @stanfordnlp 🌲 w/ Stanford CS first all-female committee!! My dissertation focused on A….

0

39

0

Alyssa Unell

@AlyssaUnell

2 months

Excited to present this work at ICLR's SynthData Workshop on Sunday April 27! Come through from 11:30-12:30 @ Peridot 202-203 to talk anything synthetic data for post-training, benchmarking, and AI for healthcare in general.

Alyssa Unell

@AlyssaUnell

4 months

1/🧵Introducing TIMER: Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records. When we evaluate LLMs for reasoning over longitudinal clinical records, can we leverage synthetic data generation to create scalable benchmarks and improve model performance?

1

9

44

Alyssa Unell

@AlyssaUnell

2 months

RT @jasonafries: 🎉 Excited to present our #ICLR2025 work—leveraging future medical outcomes to improve pretraining for prognostic vision mo….

0

7

0

Alyssa Unell

@AlyssaUnell

3 months

RT @AkliluJosiah2: There’s growing excitement around VLMs and their potential to transform surgery🏥—but where exactly are we on the path to….

0

6

0

Alyssa Unell

@AlyssaUnell

3 months

RT @IsabelOGallegos: 🚨🚨New Working Paper🚨🚨. AI-generated content is getting more politically persuasive. But does labeling it as AI-generat….

0

45

0

Alyssa Unell

@AlyssaUnell

3 months

RT @kenziyuliu: An LLM generates an article verbatim—did it “train on” the article?. It’s complicated: under n-gram definitions of train-se….

0

82

0

Alyssa Unell

@AlyssaUnell

4 months

RT @jmhb0: 🚨Large video-language models LLaVA-Video can do single-video tasks. But can they compare videos?. Imagine you’re learning a spo….

0

47

0

Alyssa Unell

@AlyssaUnell

4 months

7/🧵Thanks to my many collaborators on this project, including @HennyJieCC, @chenbowen118, @drnigam, @Emily_Alsentzer, @sanmikoyejo @jasonafries as well as feedback from @bedisuhana42170, @michaelwornow and the teams at @stai_research and the Shah Lab.

0

4

7

Alyssa Unell

@AlyssaUnell

4 months

6/🧵 This work is the beginning of a new paradigm for synthetic evaluations and temporal reasoning, drawing attention to model temporal biases and current benchmark gaps in healthcare. ⭐Check out the paper for more information!. Arxiv:

1

2

7

Alyssa Unell

@AlyssaUnell

4 months

5/🧵Instruction tuning on synthetic data can improve performance across both synthetic and physician generated benchmarks.

1

2

5

Alyssa Unell

@AlyssaUnell

4 months

4/🧵🚨We find that in text generation conditioned on a long-context input, there is lost in the middle effect– indicating that we need to consider how we sample from our distribution for full longitudinal coverage!

1

2

4