Nikhil Chandak Profile
Nikhil Chandak

@nikhilchandak29

Followers
588
Following
1K
Media
16
Statuses
92

PhD Student at Max Planck Institute. Past @iiit_hyderabad @VectorInst. Interested in better evals, forecasting, and open-endedness.

Tübingen, Germany
Joined December 2016
Don't wanna be here? Send us removal request.
@nikhilchandak29
Nikhil Chandak
5 months
🚨 Ever wondered how much you can ace popular MCQ benchmarks without even looking at the questions? 🤯 Turns out, you can often get significant accuracy just from the choices alone. This is true even on recent benchmarks with 10 choices (like MMLU-Pro) and their vision
3
25
74
@nikhilchandak29
Nikhil Chandak
4 days
Come at the Scaling Environments for Agents (SEA) workshop at NeurIPS today for a preview of our upcoming work presented by @jonasgeiping on Open-Ended Forecasting of World Events where we show how to go beyond prediction markets to scale forecasting training data for LLMs!
@ShashwatGoel7
Shashwat Goel
6 days
🚨Checkout an exclusive preview of our soon-to-be-released project on Training LLMs for open-ended forecasting, with RL on synthetic data ;) Tomorrow at the NeurIPS Scaling Environments for Agents (SEA) workshop (12:20-13:20, 15:50-16:50, Upper Level Room 23ABC, @jonasgeiping)
1
2
16
@nikhilchandak29
Nikhil Chandak
5 days
Added @MPI_IS @ELLISInst_Tue to the plot below. And this is only considering A100 or newer (H100/B200) GPUs. Periodic reminder to join us in the small beautiful town of Tübingen where we have a good ratio of GPUs per researcher (better than most academic places) and great PIs!
@FrancoisChauba1
Francois Chaubard
6 days
Last night, @agupta and I hosted a great dinner with 14 professors at #NeurIPS2025 from leading academic labs across the US, and many cited compute in academia as "abhorrent". Out of curiosity I just pulled these stats. This is insane. To do meaningful AI research today you need
1
2
21
@kotekjedi_ml
Alexander Panfilov
30 days
A “Who is Adam?” successor has arrived
16
32
523
@prasannamayil
Prasanna Mayilvahanan
2 months
🚀 New Paper & Benchmark! Introducing MATH-Beyond (MATH-B), a new math reasoning benchmark deliberately constructed for common open-source models (≤8B) to fail at pass@1024! Paper: https://t.co/G0KRqy379q Dataset: https://t.co/MzDfdpGils 🧵1/10
2
11
34
@kotekjedi_ml
Alexander Panfilov
2 months
🚨 New paper! AI control protocols aim to keep powerful untrusted AIs in check, often by scoring their actions via a weaker trusted monitor. But what happens when the ASI jailbreaks its way out? 👇 Check out Misha’s thread to see how control collapses under adaptive pressure.
@MiTerekhov
Mikhail Terekhov
2 months
You are a scheming AI. You find yourself deployed with an AI control protocol, you know the details. You remember the papers. The lab is using a trusted monitor. What do you do next? Our new paper argues—with modern protocols, prompt injections will easily let you win! (1/9)🧵
1
4
15
@nikhilchandak29
Nikhil Chandak
2 months
Thinking of doing a PhD/research visit? Consider applying to/visiting @ELLISInst_Tue @MPI_IS. We have PIs doing cool work, great funding, and most importantly, one of the best academic compute: 50+ GB200s, 250+ H100s, and many more A100 80GBs. Come join us!
16
32
425
@nikhilchandak29
Nikhil Chandak
2 months
even if you get only a single bit of information from RL, it might still be a higher order (i.e., more important) bit
@ShashwatGoel7
Shashwat Goel
2 months
There's been confusion on the importance of RL after @johnschulman2's excellent blog showing it learns surprisingly less bits of information. Here's my blog on what we might be missing: Not all bits are made equal. Some bits of information matter more than others. This
0
1
8
@nikhilchandak29
Nikhil Chandak
3 months
cool evals workshop happening in Copenhagen, worth checking out if you are attending @EurIPSConf this December.
@YatongChen
Yatong Chen @ NeurIPS2025
3 months
We (Moritz Hardt, @walesalaudeen96,@joavanschoren) are organizing the Workshop on the Science of Benchmarking & Evaluating AI @EurIPSConf 2025 in Copenhagen! 📢 Call for Posters: https://t.co/jeXRNDexuX 📅 Deadline: Oct 10, 2025 (AoE) 🔗 More Info: https://t.co/zZmkGzGsRg
0
1
5
@nikhilchandak29
Nikhil Chandak
3 months
cute work by @akshitwt @arvindh__a @ShashwatGoel7 this simple plot shows how even small improvements in per-step accuracy of models can lead to large improvements in task execution!
@arvindh__a
Arvindh Arun
3 months
Why does horizon length grow exponentially as shown in the METR plot? Our new paper investigates this by isolating the execution capabilities of LLMs. Here's why you shouldn't be fooled by slowing progress on typical short-task benchmarks... 🧵
0
2
8
@arvindh__a
Arvindh Arun
4 months
SEMMA (transliteration of செம்ம - meaning awesome), my first PhD work, is accepted to #EMNLP2025 Main! I also found out today that SEMMA has the (tied) highest average reviewer score in this ARR cycle 💪 📜: https://t.co/mhb6ecROz9
@arvindh__a
Arvindh Arun
7 months
Does text help KG Foundation Models generalize better? 🤔 Yes (and no)! ☯️ Bootstrapped by LLMs improving KG relation labels, we show that textual similarity between relations can act as an invariance - helping generalization across datasets! 🧵👇
4
3
33
@nikhilchandak29
Nikhil Chandak
4 months
We have hit new high in chart crime
@akshitwt
Akshit
4 months
no way bro what are they doing 😭😭😭😭
2
0
1
@nikhilchandak29
Nikhil Chandak
4 months
@s_tworkowski any plans to open-source/release grok-3 mini? it is such an amazing model for its price, would be great to have it in the community!
0
0
8
@nikhilchandak29
Nikhil Chandak
4 months
New open-weights models came out from @OpenAI! On GPQA-Diamond, it shows strong performance but is not better than open-source models like Kimi K2, R1 or recent Qwen3-235B What it is — SOTA at its size. What it is NOT — o4-mini (as many people have been claiming..)
3
14
118
@GregHBurnham
Greg Burnham
5 months
Pretty happy with how my predictions are holding up. 5/6 was the gold medal threshold this year. OAI's "experimental reasoning LLM" got that exactly, failing only to solve the one hard combinatorics problem, P6. My advice remains: look beyond the medal. Brief thread. 1/
@alexwei_
Alexander Wei
5 months
1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).
6
31
251
@nikhilchandak29
Nikhil Chandak
5 months
Meanwhile, @Kimi_Moonshot has actually cooked with K2. Even without extended reasoning, it is on par with frontier models like Grok-4 on GPQA free-form. Massive congrats to them.
@nikhilchandak29
Nikhil Chandak
5 months
🚨Thought Grok-4 saturated GPQA? Not yet! ⚖️Same questions, when evaluated free-form, Grok-4 is no better than its smaller predecessor Grok-3-mini! Even @OpenAI's o4-mini outperforms Grok-4 here. As impressive as Grok-4 is, benchmarks have not saturated just yet. Also, have
7
21
220
@nikhilchandak29
Nikhil Chandak
5 months
Dataset available here:
Tweet card summary image
huggingface.co
1
0
20
@nikhilchandak29
Nikhil Chandak
5 months
The free-form version mentioned was developed as part of a recent work on moving towards generative evals. @ShashwatGoel7 and @AmyPrb will be presenting this work next Friday at the ICML Workshop on Assessing World Models. Hit them up if interested! Check out other cool work
1
0
25
@nikhilchandak29
Nikhil Chandak
5 months
🚨Thought Grok-4 saturated GPQA? Not yet! ⚖️Same questions, when evaluated free-form, Grok-4 is no better than its smaller predecessor Grok-3-mini! Even @OpenAI's o4-mini outperforms Grok-4 here. As impressive as Grok-4 is, benchmarks have not saturated just yet. Also, have
25
31
264
@florian_tramer
Florian Tramèr
5 months
Very cool result. In hindsight, this shouldn't be too surprising to anyone who has ever taken a multiple choice exam. Eg if you have a trigonometry problem and the possible solutions are A: 1 B: 3.7 C: -5 D: pi/2 which would you pick (with no knowledge of the question)?
@nikhilchandak29
Nikhil Chandak
5 months
🚨 Ever wondered how much you can ace popular MCQ benchmarks without even looking at the questions? 🤯 Turns out, you can often get significant accuracy just from the choices alone. This is true even on recent benchmarks with 10 choices (like MMLU-Pro) and their vision
1
8
31