
Nikhil Chandak
@nikhilchandak29
Followers
214
Following
616
Media
10
Statuses
63
PhD Student at Max Planck Institute. Past @iiit_hyderabad @VectorInst. Thinking about a post-AGI world these days
Tübingen, Germany
Joined December 2016
The free-form version mentioned was developed as part of a recent work on moving towards generative evals. @ShashwatGoel7 and @AmyPrb will be presenting this work next Friday at the ICML Workshop on Assessing World Models. Hit them up if interested! Check out other cool work.
1
0
10
🚨Thought Grok-4 saturated GPQA? Not yet! . ⚖️Same questions, when evaluated free-form, Grok-4 is no better than its smaller predecessor Grok-3-mini! . Even @OpenAI's o4-mini outperforms Grok-4 here. As impressive as Grok-4 is, benchmarks have not saturated just yet. Also, have
10
11
102
RT @florian_tramer: Very cool result. In hindsight, this shouldn't be too surprising to anyone who has ever taken a multiple choice exam.….
0
7
0
@DanHendrycks @_jasonwei @LiamFedus @mia_glaese Work with @ShashwatGoel7 @jonasgeiping @AmyPrb and Moritz Hardt at the @MPI_IS @ELLISInst_Tue @uni_tue .Paper (recently accepted at ICML'25 World Models Workshops): We also release our code and annotations for GPQA-Diamond and a subset of MMLU Pro.
1
2
7
RT @ShashwatGoel7: TIL half of SWE-Bench-Verified is fixing issues in a single repository. We really need to be careful with how we name be….
0
2
0
RT @pathak2206: A great example of scientific discourse at its best—thoughtful, constructive, and conclusive. We now have more rigorous evi….
0
2
0
RT @mihirp98: 1/ Maximizing confidence indeed improves reasoning. We worked with @ShashwatGoel7, @nikhilchandak29 @AmyPrb for the past 3 we….
0
13
0
RT @jonasgeiping: Forecasting future events is a fascinating task for language models. Arguably the hardest application for a pure "oracle"….
0
11
0
RT @nikhilchandak29: @zzlccc Circling back, it seems like your base model numbers are also quite different from what is reported in Qwen3 r….
0
2
0
RT @arvindh__a: Does text help KG Foundation Models generalize better? 🤔. Yes (and no)! ☯️. Bootstrapped by LLMs improving KG relation labe….
0
9
0
Yes, let's not try to attack anyone (we don't mean to either). Our goal isn’t to detract from the potential of these proposed methods; we’re simply highlighting the importance of careful baseline comparisons to better contextualize the claims made. (1/2).
Quick clarifications:.1. Paper authors should not be criticized. I'm sure we're all trying our best. 2. The higher numbers are not from "prompt engineering". Added prompts to blog, it's simple. 3. We didn't "reproduce" the RL results in same settings. Open weights would help.
1
0
3
RT @HKydlicek: Math evals are really hard to get right and if you use wrong evaluator you can easily gain insane amount of points by just a….
0
3
0
RT @YiranWu18: @StellaLisy We evaluated MATH500 on Qwen2.5-7B and Math and saw ~64% accuracy with pass@1, temp=0, and similar results in si….
0
2
0