Nikhil Chandak @nikhilchandak29 X Profile

Nikhil Chandak

@nikhilchandak29

Followers

238

Following

629

Media

10

Statuses

65

PhD Student at Max Planck Institute. Past @iiit_hyderabad @VectorInst. Thinking about a post-AGI world these days

Tübingen, Germany

Joined December 2016

Don't wanna be here? Send us removal request.

Nikhil Chandak

@nikhilchandak29

8 days

🚨 Ever wondered how much you can ace popular MCQ benchmarks without even looking at the questions? 🤯. Turns out, you can often get significant accuracy just from the choices alone. This is true even on recent benchmarks with 10 choices (like MMLU-Pro) and their vision

3

20

64

Nikhil Chandak

@nikhilchandak29

1 day

Dataset available here:

1

0

13

Nikhil Chandak

@nikhilchandak29

1 day

The free-form version mentioned was developed as part of a recent work on moving towards generative evals. @ShashwatGoel7 and @AmyPrb will be presenting this work next Friday at the ICML Workshop on Assessing World Models. Hit them up if interested! Check out other cool work.

1

0

20

Nikhil Chandak

@nikhilchandak29

1 day

🚨Thought Grok-4 saturated GPQA? Not yet! . ⚖️Same questions, when evaluated free-form, Grok-4 is no better than its smaller predecessor Grok-3-mini! . Even @OpenAI's o4-mini outperforms Grok-4 here. As impressive as Grok-4 is, benchmarks have not saturated just yet. Also, have

22

26

228

Nikhil Chandak

@nikhilchandak29

8 days

RT @florian_tramer: Very cool result. In hindsight, this shouldn't be too surprising to anyone who has ever taken a multiple choice exam.….

0

7

0

Nikhil Chandak

@nikhilchandak29

8 days

0

7

Nikhil Chandak

@nikhilchandak29

8 days

@DanHendrycks @_jasonwei @LiamFedus @mia_glaese Work with @ShashwatGoel7 @jonasgeiping @AmyPrb and Moritz Hardt at the @MPI_IS @ELLISInst_Tue @uni_tue .Paper (recently accepted at ICML'25 World Models Workshops): We also release our code and annotations for GPQA-Diamond and a subset of MMLU Pro.

1

3

8

Nikhil Chandak

@nikhilchandak29

8 days

🚀 Let's shift the benchmarking ecosystem from MCQ to Answer Matching (and more free-form eval in general)! 🎯. Impacts:. 📊 Leaderboards: Rankings change, sometimes significantly, and model accuracies drop, revealing more room for progress! 📉. 🛠️ Benchmark Creation: Instead of

1

0

3

Nikhil Chandak

@nikhilchandak29

8 days

💡 Surprise twist: Answer Matching evaluations can actually cost LESS than MCQ evaluations! 💰 (See Section 4 of our paper for the full breakdown)

1

0

3

Nikhil Chandak

@nikhilchandak29

8 days

❗️MCQs aren't measuring the true generative capabilities of language models—they’re simply testing choice discrimination! 🙅‍♂️. ✅ The fix? Answer Matching. We let models generate free-form answers, then use another language model to check if the generated answer matches the

2

0

9

Nikhil Chandak

@nikhilchandak29

8 days

RT @ShashwatGoel7: TIL half of SWE-Bench-Verified is fixing issues in a single repository. We really need to be careful with how we name be….

0

2

0

Nikhil Chandak

@nikhilchandak29

11 days

RT @pathak2206: A great example of scientific discourse at its best—thoughtful, constructive, and conclusive. We now have more rigorous evi….

0

2

0

Nikhil Chandak

@nikhilchandak29

12 days

RT @mihirp98: 1/ Maximizing confidence indeed improves reasoning. We worked with @ShashwatGoel7, @nikhilchandak29 @AmyPrb for the past 3 we….

0

13

0

Nikhil Chandak

@nikhilchandak29

1 month

RT @jonasgeiping: Forecasting future events is a fascinating task for language models. Arguably the hardest application for a pure "oracle"….

0

11

0

Nikhil Chandak

@nikhilchandak29

1 month

RT @nikhilchandak29: @zzlccc Circling back, it seems like your base model numbers are also quite different from what is reported in Qwen3 r….

0

2

0

Nikhil Chandak

@nikhilchandak29

1 month

RT @arvindh__a: Does text help KG Foundation Models generalize better? 🤔. Yes (and no)! ☯️. Bootstrapped by LLMs improving KG relation labe….

0

9

0

Nikhil Chandak

@nikhilchandak29

1 month

Also, for context, the baselines we mention come from either independent reproductions ( (no prompt optimization) or official Qwen reports. Please check our updated blog for more details. (2/2).

0

2

Nikhil Chandak

@nikhilchandak29

1 month

Yes, let's not try to attack anyone (we don't mean to either). Our goal isn’t to detract from the potential of these proposed methods; we’re simply highlighting the importance of careful baseline comparisons to better contextualize the claims made. (1/2).

Shashwat Goel ✈️ ICML 2025

@ShashwatGoel7

1 month

Quick clarifications:.1. Paper authors should not be criticized. I'm sure we're all trying our best. 2. The higher numbers are not from "prompt engineering". Added prompts to blog, it's simple. 3. We didn't "reproduce" the RL results in same settings. Open weights would help.

1

0

3

Nikhil Chandak

@nikhilchandak29

1 month

RT @HKydlicek: Math evals are really hard to get right and if you use wrong evaluator you can easily gain insane amount of points by just a….

0

3

0

Nikhil Chandak

@nikhilchandak29

1 month

RT @YiranWu18: @StellaLisy We evaluated MATH500 on Qwen2.5-7B and Math and saw ~64% accuracy with pass@1, temp=0, and similar results in si….

0

2

0