Lily Chen
@lilyychenn
Followers
174
Following
3K
Media
7
Statuses
51
How do we teach LLMs not just to reason, but to reflect, debug, and improve themselves? We at AWS AI Labs introduce MURPHY 🤖, a multi-turn RL framework that brings self-correction into #RLVR (#GRPO). 🧵👇 Link: https://t.co/3kFjI5mxR5
2
14
25
You can find our annotation data and interface here: https://t.co/QMdkPX2JvC. Many thanks to my co-lead @sebajoed and our amazing collaborators Barry Wei, @mackert, @ijmarshall, @pliang279, @RKouzyMD, @byron_c_wallace, and @jessyjli! 5/
github.com
Contribute to SebaJoe/decide-less-communicate-more development by creating an account on GitHub.
0
0
0
To address these challenges, we propose a communication model that: - clarifies intent through dialogue - guides claims toward verifiable evidence - explains diverse expert perspectives instead of forcing consensus It reframes medical fact-checking as patient–expert dialogue 4/
1
0
1
Verifying medical claims wasn’t straightforward for experts. They struggled with: 1️⃣ linking claims to evidence 2️⃣ interpreting underspecified or misguided claims 3️⃣ labeling nuanced claims—often with disagreement These challenges are inherent to end-to-end fact-checking 🚧 3/
1
0
1
We study real-world medical claims from Reddit, preserving post context and verifying them with RCT abstracts. 📄 Six experts annotated 20 claims, each with 10 abstracts. Annotations span: 1️⃣ abstract relevance 2️⃣ claim-level evidence quality 3️⃣ explanations citing abstracts 2/
1
0
0
Are we fact-checking medical claims the right way? 🩺🤔 Probably not. In our study, even experts struggled to verify Reddit health claims using end-to-end systems. We show why—and argue fact-checking should be a dialogue, with patients in the loop https://t.co/Wzbwe4i577 🧵1/
1
8
26
There are many KV cache-reduction methods, but a fair comparison is challenging. We propose a new unified metric called “critical KV footprint”. We compare existing methods and propose a new one - PruLong, which “prunes” certain attn heads to only look at local tokens. 1/7
2
38
233
I am very excited about David's @ddvd233 line of work in developing generalist multimodal clinical foundation models. CLIMB (which will be presented at ICML 2025) https://t.co/XPTiplS0xc is a large-scale benchmark comprising 4.51 million patient samples totaling 19.01 terabytes
github.com
Contribute to DDVD233/CLIMB development by creating an account on GitHub.
Thanks @iScienceLuvr for posting about our recent work! We're excited to introduce QoQ-Med, a multimodal medical foundation model that jointly reasons across medical images, videos, time series (ECG), and clinical texts. Beyond the model itself, we developed a novel training
1
4
21
How good are LLMs at 🔭 scientific computing and visualization 🔭? AstroVisBench tests how well LLMs implement scientific workflows in astronomy and visualize results. SOTA models like Gemini 2.5 Pro & Claude 4 Opus only match ground truth scientific utility 16% of the time. 🧵
1
8
23
friends at #CHI2025, Karan @realkaranahuja, Yiyue @LuoYiyue, and I are teaching a course on **Multimodal AI for human sensing and interaction** come join us and learn about the latest advances in multimodal AI, generative AI, efficient software, and sensing hardware to
2
7
49
Lots of interest in the recent o3 and o4 models, but while these more advanced multimodal AI systems start getting better at math, do they also become better intelligent tutors to help students learn math? 🚨Introducing Interactive Sketchpad, an intelligent AI tutor that
1
15
59
Can LLMs learn to reason better by "cheating"?🤯 Excited to introduce #cheatsheet: a dynamic memory module enabling LLMs to learn + reuse insights from tackling previous problems 🎯Claude3.5 23% ➡️ 50% AIME 2024 🎯GPT4o 10% ➡️ 99% on Game of 24 Great job @suzgunmirac w/ awesome
9
39
255
Vision models have been smaller than language models; what if we scale them up? Introducing Web-SSL: A family of billion-scale SSL vision models (up to 7B parameters) trained on billions of images without language supervision, using VQA to evaluate the learned representation.
8
87
480
While today’s multimodal models excel at language-based social tasks, can they understand humans without words? ...not really😶 We introduce MimeQA, a video QA dataset to test AI's nonverbal social intelligence—using mime videos 🤐 Paper: https://t.co/PFIk7pacTs 🧵1/8
2
11
14
Introducing *ARC‑AGI Without Pretraining* – ❌ No pretraining. ❌ No datasets. Just pure inference-time gradient descent on the target ARC-AGI puzzle itself, solving 20% of the evaluation set. 🧵 1/4
37
192
1K
Thrilled that we won an 🥂Outstanding Paper Award at #EMNLP2024! Super validating for using computational methods to investigate discourse processing via QUDs. Super proud of my students @YatingWu96 @ritikarmangla, amazing team @AlexGDimakis @gregd_nlp
LLMs can mimic human curiosity by generating open-ended inquisitive questions given some context, similar to how humans wonder when they read. But which ones are more important to be answered?🤔 We predict the salience of questions, substantially outperforming GPT-4.🌟 🧵1/5
14
9
130
heading to #emnlp2024! would love to chat with those interested in joining our Multisensory Intelligence research group at MIT @medialab @MITEECS
https://t.co/i4y1IK6unF Our group studies the foundations of multisensory AI to create human-AI symbiosis across scales and sensory
3
15
116
Excited for #EMNLP2024! Check out work from my students and collaborators that will be presented: https://t.co/cpwLhVsAlf
2
9
76
📣 Announcing the name and theme of my new research group at MIT @medialab @MITEECS: ***Multisensory Intelligence*** https://t.co/i4y1IK72dd Our group studies the foundations of multisensory AI to create human-AI symbiosis across scales and sensory mediums. We are hiring at
10
49
439
I'm excited to announce that our work, 𝐅𝐚𝐜𝐭𝐏𝐈𝐂𝐎, has been accepted to 𝗔𝗖𝗟 𝟮𝟬𝟮𝟰! 🎉🇹🇭 A huge thanks to all amazing collaborators 🚀🫶 #NLProc #ACL2024NLP
LLMs can write impressive-looking summaries of technical texts in plain language. But are they factual? This is critical in medicine, and the answer is tricky! Introducing ⚕️FactPICO, the first **expert** evaluation of this, with explanations Paper: https://t.co/AoSMyP0wNB 🧵1/
0
0
10