Tom Sheffer @TomSheffer17807 X Profile

Tom Sheffer

@TomSheffer17807

Followers

115

Following

101

Media

5

Statuses

23

@Google Research Software Engineer #Medicine #Medical_AI #AI_4_Science | -M.D-

Joined January 2024

Don't wanna be here? Send us removal request.

Tom Sheffer

@TomSheffer17807

4 months

Just wrapped up #ACL2025 and feeling inspired! Standout sessions on LLM self-consistency and the role of pretrained models in text embeddings show how far NLP has come. Thanks to the organizers for an amazing conference. #AI #NLP #Neuroscience

1

0

12

Tom Sheffer

@TomSheffer17807

4 months

8/8 The big takeaway: Focusing on a single AI's benchmark score is missing the forest for the trees. True progress is designing the whole forest: a diverse team of agents that can achieve synergy together. #FutureofAI #Research #Teamwork /w @GoldsteinYAriel @yanivdover alonmiron

0

2

Tom Sheffer

@TomSheffer17807

4 months

7/8 🖼️ With humans, you get synergy. Fig 3 shows clear crossover zones, yielding a Diversity Gain of up to 7pp. Both students & LLM improve after chatting proving our winning combo: calibrated confidence + diverse knowledge. #HumanAI #Teamwork

1

0

1

Tom Sheffer

@TomSheffer17807

4 months

6/8 🖼️ Fig 2 shows the LLM-only teams. The accuracy lines barely cross, meaning Diversity Gain is near zero. The result: conversation actually hurts the best LLM. This shows homogeneous knowledge leads to weak synergy.

1

0

3

Tom Sheffer

@TomSheffer17807

4 months

5/8 🖼️ Our pipeline: solo answers → 2D knowledge profile (accuracy × confidence) → chat → re-answer. We quantify "Diversity Gain": the accuracy boost from an oracle telling an uncertain agent exactly when to copy a confident partner.

1

0

3

Tom Sheffer

@TomSheffer17807

4 months

4/8 Why do humans improve? 1️⃣ Calibrated confidence: They know what they don't know. 2️⃣ Confidence drives behavior: Low confidence → switch, high → stick. 3️⃣ Diverse knowledge: They complement each other's gaps. LLMs have #1 & #2, but lack #3. With no diversity, there's no gain

1

0

3

Tom Sheffer

@TomSheffer17807

4 months

3/8 For comparison, we benchmarked this against the human standard: clinical-year medical students. Unlike the AI-only groups, the students' collaboration was a success. The team's accuracy surpassed that of its best individual member. 🧠🤝

1

0

3

Tom Sheffer

@TomSheffer17807

4 months

2/8 We tested three flagship Large Language Models (LLMs) in a group chat to solve medical board-style questions. 🏥🤖 The result? They debated at length, but their group accuracy DROPPED. The most capable model actually got dumber by listening to the others. 📉

1

0

3

Tom Sheffer

@TomSheffer17807

4 months

1/8 🚀 Our new pre-print, "Knowledge Is More Than Performance" is out! Can a room full of language models collaborate like human experts? Spoiler: not yet. And our research reveals the fundamental reason why 🧵 #AI #LLM #humanaiinteraction https://t.co/JTtNkZyjlf

1

3

10

Tom Sheffer

@TomSheffer17807

4 months

Presenting our CISC paper tomorrow at #ACL2025! ⚡️ We save >40% compute on self consistency by using the LLM's valuable internal confidence signal. 🗓️ Poster: Tues, 16:00-17:30 @ Hall X4 X5 Paper: https://t.co/N5AFzgG5Je Also chatting: LLMs in Neuro, MedNLP, & Human-AI collab!

0

7

Eliya Habba @EMNLP 🇨🇳

@EliyaHabba

4 months

Presenting my poster : 🕊️ DOVE - A large-scale multi-dimensional predictions dataset towards meaningful LLM evaluation, Monday 18:00 Vienna, #ACL2025 Come chat about LLM evaluation, prompt sensitivity, and our 250M COLLECTION OF MODEL OUTPUTS!

2

11

47

Tom Sheffer

@TomSheffer17807

4 months

See you at #ACL2025 in Vienna—come say Hi! w/ @TaubenfeldAmir eran_ofek @amir_feder @GoldsteinYAriel @zorikgekhman @_galyo @GoogleAI

3

0

7

Tom Sheffer

@TomSheffer17807

4 months

Our method uses a model's internal confidence to make self-consistency more efficient: ✅ Saves >40% compute on average ✅ Maintains performance ✅ Adds no latency overhead We're sharing the code to encourage reproduction and new research. Check it out! 💻

2

0

5

Tom Sheffer

@TomSheffer17807

4 months

Thrilled that our paper on Confidence-Informed Self-Consistency (CISC) has been accepted to #ACL2025 Findings! 🎉 Paper: https://t.co/N5AFzgG5Je (1/2)

arxiv.org

Self-consistency decoding enhances LLMs' performance on reasoning tasks by sampling diverse reasoning paths and selecting the most frequent answer. However, it is computationally expensive, as...

1

4

32

Zorik Gekhman

@zorikgekhman

5 months

Now accepted to #COLM2025! We formally define hidden knowledge in LLMs and show its existence in a controlled study. We even show that a model can know the answer yet fail to generate it in 1,000 attempts 😵 Looking forward to presenting and discussing our work in person.

Zorik Gekhman

@zorikgekhman

8 months

🚨 It's often claimed that LLMs know more facts than they show in their outputs, but what does this actually mean, and how can we measure this “hidden knowledge”? In our new paper, we clearly define this concept and design controlled experiments to test it. 1/🧵

2

19

64

Amir Taubenfeld

@TaubenfeldAmir

10 months

New Preprint 🎉 LLM self-assessment unlocks efficient decoding ✅ Our Confidence-Informed Self-Consistency (CISC) method cuts compute without losing accuracy. We also rethink confidence evaluation & contribute to the debate on self-verification. https://t.co/4vSCs9ETPL 1/8👇

1

20

56