Nikhil Chandak Profile
Nikhil Chandak

@nikhilchandak29

Followers
214
Following
616
Media
10
Statuses
63

PhD Student at Max Planck Institute. Past @iiit_hyderabad @VectorInst. Thinking about a post-AGI world these days

Tübingen, Germany
Joined December 2016
Don't wanna be here? Send us removal request.
@nikhilchandak29
Nikhil Chandak
7 days
🚨 Ever wondered how much you can ace popular MCQ benchmarks without even looking at the questions? 🤯. Turns out, you can often get significant accuracy just from the choices alone. This is true even on recent benchmarks with 10 choices (like MMLU-Pro) and their vision
Tweet media one
3
20
62
@nikhilchandak29
Nikhil Chandak
7 hours
Dataset available here:
0
0
5
@nikhilchandak29
Nikhil Chandak
7 hours
The free-form version mentioned was developed as part of a recent work on moving towards generative evals. @ShashwatGoel7 and @AmyPrb will be presenting this work next Friday at the ICML Workshop on Assessing World Models. Hit them up if interested! Check out other cool work.
1
0
10
@nikhilchandak29
Nikhil Chandak
7 hours
🚨Thought Grok-4 saturated GPQA? Not yet! . ⚖️Same questions, when evaluated free-form, Grok-4 is no better than its smaller predecessor Grok-3-mini! . Even @OpenAI's o4-mini outperforms Grok-4 here. As impressive as Grok-4 is, benchmarks have not saturated just yet. Also, have
Tweet media one
10
11
102
@nikhilchandak29
Nikhil Chandak
7 days
RT @florian_tramer: Very cool result. In hindsight, this shouldn't be too surprising to anyone who has ever taken a multiple choice exam.….
0
7
0
@nikhilchandak29
Nikhil Chandak
7 days
@DanHendrycks @_jasonwei @LiamFedus @mia_glaese Work with @ShashwatGoel7 @jonasgeiping @AmyPrb and Moritz Hardt at the @MPI_IS @ELLISInst_Tue @uni_tue .Paper (recently accepted at ICML'25 World Models Workshops): We also release our code and annotations for GPQA-Diamond and a subset of MMLU Pro.
1
2
7
@nikhilchandak29
Nikhil Chandak
7 days
🚀 Let's shift the benchmarking ecosystem from MCQ to Answer Matching (and more free-form eval in general)! 🎯. Impacts:. 📊 Leaderboards: Rankings change, sometimes significantly, and model accuracies drop, revealing more room for progress! 📉. 🛠️ Benchmark Creation: Instead of
Tweet media one
1
0
3
@nikhilchandak29
Nikhil Chandak
7 days
💡 Surprise twist: Answer Matching evaluations can actually cost LESS than MCQ evaluations! 💰 (See Section 4 of our paper for the full breakdown)
Tweet media one
1
0
3
@nikhilchandak29
Nikhil Chandak
7 days
❗️MCQs aren't measuring the true generative capabilities of language models—they’re simply testing choice discrimination! 🙅‍♂️. ✅ The fix? Answer Matching. We let models generate free-form answers, then use another language model to check if the generated answer matches the
Tweet media one
2
0
9
@nikhilchandak29
Nikhil Chandak
7 days
RT @ShashwatGoel7: TIL half of SWE-Bench-Verified is fixing issues in a single repository. We really need to be careful with how we name be….
0
2
0
@nikhilchandak29
Nikhil Chandak
10 days
RT @pathak2206: A great example of scientific discourse at its best—thoughtful, constructive, and conclusive. We now have more rigorous evi….
0
2
0
@nikhilchandak29
Nikhil Chandak
11 days
RT @mihirp98: 1/ Maximizing confidence indeed improves reasoning. We worked with @ShashwatGoel7, @nikhilchandak29 @AmyPrb for the past 3 we….
0
13
0
@nikhilchandak29
Nikhil Chandak
1 month
RT @jonasgeiping: Forecasting future events is a fascinating task for language models. Arguably the hardest application for a pure "oracle"….
0
11
0
@nikhilchandak29
Nikhil Chandak
1 month
RT @nikhilchandak29: @zzlccc Circling back, it seems like your base model numbers are also quite different from what is reported in Qwen3 r….
0
2
0
@nikhilchandak29
Nikhil Chandak
1 month
RT @arvindh__a: Does text help KG Foundation Models generalize better? 🤔. Yes (and no)! ☯️. Bootstrapped by LLMs improving KG relation labe….
0
9
0
@nikhilchandak29
Nikhil Chandak
1 month
Also, for context, the baselines we mention come from either independent reproductions ( (no prompt optimization) or official Qwen reports. Please check our updated blog for more details. (2/2).
0
0
2
@nikhilchandak29
Nikhil Chandak
1 month
Yes, let's not try to attack anyone (we don't mean to either). Our goal isn’t to detract from the potential of these proposed methods; we’re simply highlighting the importance of careful baseline comparisons to better contextualize the claims made. (1/2).
@ShashwatGoel7
Shashwat Goel
1 month
Quick clarifications:.1. Paper authors should not be criticized. I'm sure we're all trying our best. 2. The higher numbers are not from "prompt engineering". Added prompts to blog, it's simple. 3. We didn't "reproduce" the RL results in same settings. Open weights would help.
1
0
3
@nikhilchandak29
Nikhil Chandak
1 month
RT @HKydlicek: Math evals are really hard to get right and if you use wrong evaluator you can easily gain insane amount of points by just a….
0
3
0
@nikhilchandak29
Nikhil Chandak
1 month
RT @YiranWu18: @StellaLisy We evaluated MATH500 on Qwen2.5-7B and Math and saw ~64% accuracy with pass@1, temp=0, and similar results in si….
0
2
0