Nikhil Chandak
@nikhilchandak29
Followers
588
Following
1K
Media
16
Statuses
92
PhD Student at Max Planck Institute. Past @iiit_hyderabad @VectorInst. Interested in better evals, forecasting, and open-endedness.
Tübingen, Germany
Joined December 2016
🚨 Ever wondered how much you can ace popular MCQ benchmarks without even looking at the questions? 🤯 Turns out, you can often get significant accuracy just from the choices alone. This is true even on recent benchmarks with 10 choices (like MMLU-Pro) and their vision
3
25
74
Come at the Scaling Environments for Agents (SEA) workshop at NeurIPS today for a preview of our upcoming work presented by @jonasgeiping on Open-Ended Forecasting of World Events where we show how to go beyond prediction markets to scale forecasting training data for LLMs!
🚨Checkout an exclusive preview of our soon-to-be-released project on Training LLMs for open-ended forecasting, with RL on synthetic data ;) Tomorrow at the NeurIPS Scaling Environments for Agents (SEA) workshop (12:20-13:20, 15:50-16:50, Upper Level Room 23ABC, @jonasgeiping)
1
2
16
Added @MPI_IS @ELLISInst_Tue to the plot below. And this is only considering A100 or newer (H100/B200) GPUs. Periodic reminder to join us in the small beautiful town of Tübingen where we have a good ratio of GPUs per researcher (better than most academic places) and great PIs!
Last night, @agupta and I hosted a great dinner with 14 professors at #NeurIPS2025 from leading academic labs across the US, and many cited compute in academia as "abhorrent". Out of curiosity I just pulled these stats. This is insane. To do meaningful AI research today you need
1
2
21
🚀 New Paper & Benchmark! Introducing MATH-Beyond (MATH-B), a new math reasoning benchmark deliberately constructed for common open-source models (≤8B) to fail at pass@1024! Paper: https://t.co/G0KRqy379q Dataset: https://t.co/MzDfdpGils 🧵1/10
2
11
34
🚨 New paper! AI control protocols aim to keep powerful untrusted AIs in check, often by scoring their actions via a weaker trusted monitor. But what happens when the ASI jailbreaks its way out? 👇 Check out Misha’s thread to see how control collapses under adaptive pressure.
You are a scheming AI. You find yourself deployed with an AI control protocol, you know the details. You remember the papers. The lab is using a trusted monitor. What do you do next? Our new paper argues—with modern protocols, prompt injections will easily let you win! (1/9)🧵
1
4
15
Thinking of doing a PhD/research visit? Consider applying to/visiting @ELLISInst_Tue @MPI_IS. We have PIs doing cool work, great funding, and most importantly, one of the best academic compute: 50+ GB200s, 250+ H100s, and many more A100 80GBs. Come join us!
16
32
425
even if you get only a single bit of information from RL, it might still be a higher order (i.e., more important) bit
There's been confusion on the importance of RL after @johnschulman2's excellent blog showing it learns surprisingly less bits of information. Here's my blog on what we might be missing: Not all bits are made equal. Some bits of information matter more than others. This
0
1
8
cool evals workshop happening in Copenhagen, worth checking out if you are attending @EurIPSConf this December.
We (Moritz Hardt, @walesalaudeen96,@joavanschoren) are organizing the Workshop on the Science of Benchmarking & Evaluating AI @EurIPSConf 2025 in Copenhagen! 📢 Call for Posters: https://t.co/jeXRNDexuX 📅 Deadline: Oct 10, 2025 (AoE) 🔗 More Info: https://t.co/zZmkGzGsRg
0
1
5
cute work by @akshitwt @arvindh__a @ShashwatGoel7 this simple plot shows how even small improvements in per-step accuracy of models can lead to large improvements in task execution!
Why does horizon length grow exponentially as shown in the METR plot? Our new paper investigates this by isolating the execution capabilities of LLMs. Here's why you shouldn't be fooled by slowing progress on typical short-task benchmarks... 🧵
0
2
8
SEMMA (transliteration of செம்ம - meaning awesome), my first PhD work, is accepted to #EMNLP2025 Main! I also found out today that SEMMA has the (tied) highest average reviewer score in this ARR cycle 💪 📜: https://t.co/mhb6ecROz9
Does text help KG Foundation Models generalize better? 🤔 Yes (and no)! ☯️ Bootstrapped by LLMs improving KG relation labels, we show that textual similarity between relations can act as an invariance - helping generalization across datasets! 🧵👇
4
3
33
We have hit new high in chart crime
2
0
1
@s_tworkowski any plans to open-source/release grok-3 mini? it is such an amazing model for its price, would be great to have it in the community!
0
0
8
New open-weights models came out from @OpenAI! On GPQA-Diamond, it shows strong performance but is not better than open-source models like Kimi K2, R1 or recent Qwen3-235B What it is — SOTA at its size. What it is NOT — o4-mini (as many people have been claiming..)
3
14
118
Pretty happy with how my predictions are holding up. 5/6 was the gold medal threshold this year. OAI's "experimental reasoning LLM" got that exactly, failing only to solve the one hard combinatorics problem, P6. My advice remains: look beyond the medal. Brief thread. 1/
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).
6
31
251
Meanwhile, @Kimi_Moonshot has actually cooked with K2. Even without extended reasoning, it is on par with frontier models like Grok-4 on GPQA free-form. Massive congrats to them.
🚨Thought Grok-4 saturated GPQA? Not yet! ⚖️Same questions, when evaluated free-form, Grok-4 is no better than its smaller predecessor Grok-3-mini! Even @OpenAI's o4-mini outperforms Grok-4 here. As impressive as Grok-4 is, benchmarks have not saturated just yet. Also, have
7
21
220
The free-form version mentioned was developed as part of a recent work on moving towards generative evals. @ShashwatGoel7 and @AmyPrb will be presenting this work next Friday at the ICML Workshop on Assessing World Models. Hit them up if interested! Check out other cool work
1
0
25
🚨Thought Grok-4 saturated GPQA? Not yet! ⚖️Same questions, when evaluated free-form, Grok-4 is no better than its smaller predecessor Grok-3-mini! Even @OpenAI's o4-mini outperforms Grok-4 here. As impressive as Grok-4 is, benchmarks have not saturated just yet. Also, have
25
31
264
Very cool result. In hindsight, this shouldn't be too surprising to anyone who has ever taken a multiple choice exam. Eg if you have a trigonometry problem and the possible solutions are A: 1 B: 3.7 C: -5 D: pi/2 which would you pick (with no knowledge of the question)?
🚨 Ever wondered how much you can ace popular MCQ benchmarks without even looking at the questions? 🤯 Turns out, you can often get significant accuracy just from the choices alone. This is true even on recent benchmarks with 10 choices (like MMLU-Pro) and their vision
1
8
31