Ameya P. Profile
Ameya P.

@AmyPrb

Followers
327
Following
1K
Media
16
Statuses
804

Exploring Science of Benchmarking & Scaling up Automated 🧬 Discovery. Postdoc @bethgelab @uni_tue; Previously: @OxfordTVG, @intelailabs RT != endorsement

Tübingen, Germany
Joined September 2021
Don't wanna be here? Send us removal request.
@AmyPrb
Ameya P.
3 months
🚨 New paper!. Exciting progress in GRPO variants, smarter training strategies, and curated datasets showing impressive improvements on math reasoning -- Is the hype justified?. Details below👇
Tweet media one
1
4
26
@AmyPrb
Ameya P.
10 hours
RT @AmandaIlze: Excited to be heading to ICML this year to present two projects, both as spotlights! 🎉.Big thanks to my collaborators — com….
0
4
0
@AmyPrb
Ameya P.
14 hours
RT @sebkrier: I've been yapping for months about bad evaluation setups and how results/AI behaviors are reported, and this new @AISecurityI….
0
30
0
@AmyPrb
Ameya P.
15 hours
RT @altndrr: What if we stopped treating image classification like a multiple-choice quiz….…and just asked the model: "What’s in this image….
0
6
0
@AmyPrb
Ameya P.
2 days
RT @AlecStapp: We need a new Marshall Plan for shipping air conditioning units to Europe
Tweet media one
0
602
0
@AmyPrb
Ameya P.
2 days
RT @tamaybes: My guess is that big tech companies increasingly opting to poach key personnel without acquiring the whole startup is driven….
0
22
0
@AmyPrb
Ameya P.
3 days
RT @nikhilchandak29: 🚨Thought Grok-4 saturated GPQA? Not yet! . ⚖️Same questions, when evaluated free-form, Grok-4 is no better than its sm….
0
28
0
@AmyPrb
Ameya P.
5 days
RT @kalyan_einstein: We train a neural network to predict distributional shifts in gene expression using LLM embeddings of unseen genetic p….
0
4
0
@AmyPrb
Ameya P.
5 days
RT @ducha_aiki: On the rankability of visual embeddings. Ankit Sonthalia @a_uselis @coallaoh . tl;dr: one can discover "property ordering….
0
7
0
@AmyPrb
Ameya P.
5 days
RT @tmkadamcz: Some of these seem pretty concerning
Tweet media one
0
3
0
@AmyPrb
Ameya P.
6 days
RT @TimKietzmann: Exciting new preprint from the lab: “Adopting a human developmental visual diet yields robust, shape-based AI vision”. A….
0
47
0
@AmyPrb
Ameya P.
6 days
RT @ShashwatGoel7: Can LMs Falsify accepted at #COLM2025 . We introduce REFUTE (the name is a recursive backronym 😌), a benchmark for model….
0
5
0
@AmyPrb
Ameya P.
7 days
RT @jsuarez5341: Happy 4th! Reinforcement learned with PufferLib. More drone demos soon. We're a private lab looking for contracts. DM if y….
0
7
0
@AmyPrb
Ameya P.
9 days
RT @vikhyatk: it's just as easy to find an image that an AI model can decipher, that most humans would struggle with. we just have differen….
0
4
0
@AmyPrb
Ameya P.
10 days
RT @jonasgeiping: Multiple-Choice benchmarks are an odd thing to use to evaluate modern LLMs, liked for their fluent, free-form responses.….
0
3
0
@AmyPrb
Ameya P.
10 days
RT @florian_tramer: Very cool result. In hindsight, this shouldn't be too surprising to anyone who has ever taken a multiple choice exam.….
0
7
0
@AmyPrb
Ameya P.
10 days
RT @gaur_manu: MCQ is great for checking existence of specific knowledge i.e if model fails to answer, it definitely lacks it. However, pro….
0
3
0
@AmyPrb
Ameya P.
10 days
RT @nikhilchandak29: 🚨 Ever wondered how much you can ace popular MCQ benchmarks without even looking at the questions? 🤯. Turns out, you c….
0
20
0
@AmyPrb
Ameya P.
10 days
RT @ShashwatGoel7: There's been a hole at the heart of #LLM evals, and we can now fix it. 📜New paper: Answer Matching Outperforms Multiple….
0
39
0
@AmyPrb
Ameya P.
10 days
@nikhilchandak29
Nikhil Chandak
10 days
🚨 Ever wondered how much you can ace popular MCQ benchmarks without even looking at the questions? 🤯. Turns out, you can often get significant accuracy just from the choices alone. This is true even on recent benchmarks with 10 choices (like MMLU-Pro) and their vision
Tweet media one
0
0
0
@AmyPrb
Ameya P.
10 days
Check out detailed threads by @ShashwatGoel7 .
@ShashwatGoel7
Shashwat Goel ✈️ ICML 2025
10 days
There's been a hole at the heart of #LLM evals, and we can now fix it. 📜New paper: Answer Matching Outperforms Multiple Choice for Language Model Evaluations. ❗️We found MCQs can be solved without even knowing the question. Looking at just the choices helps guess the answer
Tweet media one
1
0
0