
Ameya P.
@AmyPrb
Followers
327
Following
1K
Media
16
Statuses
804
Exploring Science of Benchmarking & Scaling up Automated 🧬 Discovery. Postdoc @bethgelab @uni_tue; Previously: @OxfordTVG, @intelailabs RT != endorsement
Tübingen, Germany
Joined September 2021
RT @AmandaIlze: Excited to be heading to ICML this year to present two projects, both as spotlights! 🎉.Big thanks to my collaborators — com….
0
4
0
RT @sebkrier: I've been yapping for months about bad evaluation setups and how results/AI behaviors are reported, and this new @AISecurityI….
0
30
0
RT @nikhilchandak29: 🚨Thought Grok-4 saturated GPQA? Not yet! . ⚖️Same questions, when evaluated free-form, Grok-4 is no better than its sm….
0
28
0
RT @kalyan_einstein: We train a neural network to predict distributional shifts in gene expression using LLM embeddings of unseen genetic p….
0
4
0
RT @ducha_aiki: On the rankability of visual embeddings. Ankit Sonthalia @a_uselis @coallaoh . tl;dr: one can discover "property ordering….
0
7
0
RT @TimKietzmann: Exciting new preprint from the lab: “Adopting a human developmental visual diet yields robust, shape-based AI vision”. A….
0
47
0
RT @ShashwatGoel7: Can LMs Falsify accepted at #COLM2025 . We introduce REFUTE (the name is a recursive backronym 😌), a benchmark for model….
0
5
0
RT @jsuarez5341: Happy 4th! Reinforcement learned with PufferLib. More drone demos soon. We're a private lab looking for contracts. DM if y….
0
7
0
RT @jonasgeiping: Multiple-Choice benchmarks are an odd thing to use to evaluate modern LLMs, liked for their fluent, free-form responses.….
0
3
0
RT @florian_tramer: Very cool result. In hindsight, this shouldn't be too surprising to anyone who has ever taken a multiple choice exam.….
0
7
0
RT @gaur_manu: MCQ is great for checking existence of specific knowledge i.e if model fails to answer, it definitely lacks it. However, pro….
0
3
0
RT @nikhilchandak29: 🚨 Ever wondered how much you can ace popular MCQ benchmarks without even looking at the questions? 🤯. Turns out, you c….
0
20
0
RT @ShashwatGoel7: There's been a hole at the heart of #LLM evals, and we can now fix it. 📜New paper: Answer Matching Outperforms Multiple….
0
39
0
Check out detailed threads by @ShashwatGoel7 .
There's been a hole at the heart of #LLM evals, and we can now fix it. 📜New paper: Answer Matching Outperforms Multiple Choice for Language Model Evaluations. ❗️We found MCQs can be solved without even knowing the question. Looking at just the choices helps guess the answer
1
0
0