Tom Sheffer
@TomSheffer17807
Followers
115
Following
101
Media
5
Statuses
23
@Google Research Software Engineer #Medicine #Medical_AI #AI_4_Science | -M.D-
Joined January 2024
Just wrapped up #ACL2025 and feeling inspired! Standout sessions on LLM self-consistency and the role of pretrained models in text embeddings show how far NLP has come. Thanks to the organizers for an amazing conference. #AI #NLP #Neuroscience
1
0
12
8/8 The big takeaway: Focusing on a single AI's benchmark score is missing the forest for the trees. True progress is designing the whole forest: a diverse team of agents that can achieve synergy together. #FutureofAI #Research #Teamwork /w @GoldsteinYAriel @yanivdover alonmiron
0
0
2
6/8 ๐ผ๏ธ Fig 2 shows the LLM-only teams. The accuracy lines barely cross, meaning Diversity Gain is near zero. The result: conversation actually hurts the best LLM. This shows homogeneous knowledge leads to weak synergy.
1
0
3
5/8 ๐ผ๏ธ Our pipeline: solo answers โ 2D knowledge profile (accuracy ร confidence) โ chat โ re-answer. We quantify "Diversity Gain": the accuracy boost from an oracle telling an uncertain agent exactly when to copy a confident partner.
1
0
3
4/8 Why do humans improve? 1๏ธโฃ Calibrated confidence: They know what they don't know. 2๏ธโฃ Confidence drives behavior: Low confidence โ switch, high โ stick. 3๏ธโฃ Diverse knowledge: They complement each other's gaps. LLMs have #1 & #2, but lack #3. With no diversity, there's no gain
1
0
3
3/8 For comparison, we benchmarked this against the human standard: clinical-year medical students. Unlike the AI-only groups, the students' collaboration was a success. The team's accuracy surpassed that of its best individual member. ๐ง ๐ค
1
0
3
2/8 We tested three flagship Large Language Models (LLMs) in a group chat to solve medical board-style questions. ๐ฅ๐ค The result? They debated at length, but their group accuracy DROPPED. The most capable model actually got dumber by listening to the others. ๐
1
0
3
1/8 ๐ Our new pre-print, "Knowledge Is More Than Performance" is out! Can a room full of language models collaborate like human experts? Spoiler: not yet. And our research reveals the fundamental reason why ๐งต #AI #LLM #humanaiinteraction
https://t.co/JTtNkZyjlf
1
3
10
Presenting our CISC paper tomorrow at #ACL2025! โก๏ธ We save >40% compute on self consistency by using the LLM's valuable internal confidence signal. ๐๏ธ Poster: Tues, 16:00-17:30 @ Hall X4 X5 Paper: https://t.co/N5AFzgG5Je Also chatting: LLMs in Neuro, MedNLP, & Human-AI collab!
0
0
7
Presenting my poster : ๐๏ธ DOVE - A large-scale multi-dimensional predictions dataset towards meaningful LLM evaluation, Monday 18:00 Vienna, #ACL2025 Come chat about LLM evaluation, prompt sensitivity, and our 250M COLLECTION OF MODEL OUTPUTS!
2
11
47
See you at #ACL2025 in Viennaโcome say Hi! w/ @TaubenfeldAmir eran_ofek @amir_feder @GoldsteinYAriel @zorikgekhman @_galyo @GoogleAI
3
0
7
Our method uses a model's internal confidence to make self-consistency more efficient: โ
Saves >40% compute on average โ
Maintains performance โ
Adds no latency overhead We're sharing the code to encourage reproduction and new research. Check it out! ๐ป
2
0
5
Thrilled that our paper on Confidence-Informed Self-Consistency (CISC) has been accepted to #ACL2025 Findings! ๐ Paper: https://t.co/N5AFzgG5Je (1/2)
arxiv.org
Self-consistency decoding enhances LLMs' performance on reasoning tasks by sampling diverse reasoning paths and selecting the most frequent answer. However, it is computationally expensive, as...
1
4
32
Now accepted to #COLM2025! We formally define hidden knowledge in LLMs and show its existence in a controlled study. We even show that a model can know the answer yet fail to generate it in 1,000 attempts ๐ต Looking forward to presenting and discussing our work in person.
๐จ It's often claimed that LLMs know more facts than they show in their outputs, but what does this actually mean, and how can we measure this โhidden knowledgeโ? In our new paper, we clearly define this concept and design controlled experiments to test it. 1/๐งต
2
19
64
New Preprint ๐ LLM self-assessment unlocks efficient decoding โ
Our Confidence-Informed Self-Consistency (CISC) method cuts compute without losing accuracy. We also rethink confidence evaluation & contribute to the debate on self-verification. https://t.co/4vSCs9ETPL 1/8๐
1
20
56