Alyssa Unell Profile
Alyssa Unell

@AlyssaUnell

Followers
225
Following
202
Media
5
Statuses
56

CS PhD Student @StanfordAILab. Previously @MIT_CSAIL. Robustness and reliability of ML for healthcare.

Joined November 2023
Don't wanna be here? Send us removal request.
@AlyssaUnell
Alyssa Unell
15 days
RT @dorazhao9: There’s a lot of speculation around how AI will change human relationships. To dig into this question, we collect surveys fr….
0
9
0
@AlyssaUnell
Alyssa Unell
30 days
🌟Big thank you to all of our collaborators as well as my amazing co-first authors @BediSuhana42170 @HennyJieCC @_Miguel_Fuentes for their incredibly hard work and amazing execution!.
0
0
3
@AlyssaUnell
Alyssa Unell
30 days
🚀 Join the future of medical AI evaluation - everything is open source and ready for collaboration:.📊 Interactive leaderboard: 💻Complete codebase & docs: 📄 Full paper:
1
0
1
@AlyssaUnell
Alyssa Unell
30 days
🌟 Why this matters: Healthcare systems need task-specific performance data—not just exam scores—to deploy AI systems safely. MedHELM provides the standardized evaluation framework the medical AI community has been missing.
1
0
2
@AlyssaUnell
Alyssa Unell
30 days
📊 Performance varies by category:.🔹Clinical Note Generation: 0.74-0.85 (strong).🔹Patient Communication & Education: 0.76-0.89 (strong).🔹Medical Research: 0.65-0.75 (moderate).🔹Clinical Decision Support: 0.61-0.76 (moderate).🔹Admin & Workflow: 0.53-0.63 (weakest).
1
0
2
@AlyssaUnell
Alyssa Unell
30 days
🔬 We evaluated 9 frontier LLMs and found surprising performance gaps:.🏆 Top performers: DeepSeek R1 (66% win-rate), o3-mini (64%) .💰 Best value: Claude 3.5 Sonnet (comparable performance at 40% lower cost).Not all "medical AI" is created equal.
1
0
2
@AlyssaUnell
Alyssa Unell
30 days
💊 MedHELM Solution: We built a clinician-validated taxonomy with 29 physicians covering:.✅5 categories.✅22 subcategories.✅121 specific medical tasks.We evaluate 35 benchmarks (17 existing + 18 new) spanning ALL categories and subcategories.
1
0
3
@AlyssaUnell
Alyssa Unell
30 days
⚕️ Current medical AI benchmarks have 3 critical flaws:.❌Synthetic scenarios ≠ real clinical complexity.❌Only 5% use actual EHR data.❌Focus on exams, not daily workflows (admin tasks, documentation, patient communication).We need better evaluation standards.
1
0
2
@AlyssaUnell
Alyssa Unell
30 days
🧵 🩺 LLMs score ~99% on medical licensing exams, but are they ready for real medical deployment? Our new research reveals major gaps between test performance and clinical readiness. Introducing MedHELM: Holistic Evaluation of Large Language Models for Medical Applications👇
Tweet media one
2
7
33
@AlyssaUnell
Alyssa Unell
2 months
RT @rose_e_wang: I defended my PhD from Stanford CS @stanfordnlp 🌲 w/ Stanford CS first all-female committee!! My dissertation focused on A….
0
39
0
@AlyssaUnell
Alyssa Unell
2 months
Excited to present this work at ICLR's SynthData Workshop on Sunday April 27! Come through from 11:30-12:30 @ Peridot 202-203 to talk anything synthetic data for post-training, benchmarking, and AI for healthcare in general.
@AlyssaUnell
Alyssa Unell
4 months
1/🧵Introducing TIMER: Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records. When we evaluate LLMs for reasoning over longitudinal clinical records, can we leverage synthetic data generation to create scalable benchmarks and improve model performance?
Tweet media one
1
9
44
@AlyssaUnell
Alyssa Unell
2 months
RT @jasonafries: 🎉 Excited to present our #ICLR2025 work—leveraging future medical outcomes to improve pretraining for prognostic vision mo….
0
7
0
@AlyssaUnell
Alyssa Unell
3 months
RT @AkliluJosiah2: There’s growing excitement around VLMs and their potential to transform surgery🏥—but where exactly are we on the path to….
0
6
0
@AlyssaUnell
Alyssa Unell
3 months
RT @IsabelOGallegos: 🚨🚨New Working Paper🚨🚨. AI-generated content is getting more politically persuasive. But does labeling it as AI-generat….
0
45
0
@AlyssaUnell
Alyssa Unell
3 months
RT @kenziyuliu: An LLM generates an article verbatim—did it “train on” the article?. It’s complicated: under n-gram definitions of train-se….
0
82
0
@AlyssaUnell
Alyssa Unell
4 months
RT @jmhb0: 🚨Large video-language models LLaVA-Video can do single-video tasks. But can they compare videos?. Imagine you’re learning a spo….
0
47
0
@AlyssaUnell
Alyssa Unell
4 months
7/🧵Thanks to my many collaborators on this project, including @HennyJieCC, @chenbowen118, @drnigam, @Emily_Alsentzer, @sanmikoyejo @jasonafries as well as feedback from @bedisuhana42170, @michaelwornow and the teams at @stai_research and the Shah Lab.
0
4
7
@AlyssaUnell
Alyssa Unell
4 months
6/🧵 This work is the beginning of a new paradigm for synthetic evaluations and temporal reasoning, drawing attention to model temporal biases and current benchmark gaps in healthcare. ⭐Check out the paper for more information!. Arxiv:
1
2
7
@AlyssaUnell
Alyssa Unell
4 months
5/🧵Instruction tuning on synthetic data can improve performance across both synthetic and physician generated benchmarks.
Tweet media one
1
2
5
@AlyssaUnell
Alyssa Unell
4 months
4/🧵🚨We find that in text generation conditioned on a long-context input, there is lost in the middle effect– indicating that we need to consider how we sample from our distribution for full longitudinal coverage!
Tweet media one
1
2
4