Minyang Tian ✈️ NeurIPS
@MinyangTian1
Followers
212
Following
7
Media
10
Statuses
30
Physics PhD candidate at UIUC, AI4Science, co-advised by @haopeng_uiuc and Eliu Huerta @argonne and @UChicago
Joined July 2024
Can LLMs help physicists break new ground in real frontier research? We introduce CritPt (Complex Research using Integrated Thinking - Physics Test, pronounced "Critical Point"): the first benchmark of unpublished, realistic research-level reasoning challenges broadly spanning
13
22
156
Today we're launching one of the toughest benchmarks I've ever been part of creating: CritPt. It's basically our take on FrontierMath but for physics. The problems are original, research-level challenges written by actual scientists. Good luck to all competitors :)
Can LLMs help physicists break new ground in real frontier research? We introduce CritPt (Complex Research using Integrated Thinking - Physics Test, pronounced "Critical Point"): the first benchmark of unpublished, realistic research-level reasoning challenges broadly spanning
0
14
128
Huge thanks to @lifan__yuan for backing the CritPt eval and helping make it happen!
SOTA models achieve ~10% on CritPt (most <3%). But what surprised (tortured😰) me most wasn't the difficulty—it was the quality control rigor. Experts spent 40h+ per problem drafting, then ~8 months iteration to ensure it's leakage-resistant yet easily verifiable. Thread below🧵
1
0
4
We’re launching a new frontier physics eval on Artificial Analysis where no model achieves greater than 9%: CritPt (Complex Research using Integrated Thinking - Physics Test) Developed by 60+ researchers from 30+ institutions across the world including the Argonne National
35
128
943
It’s been an incredible journey co-leading this project with @MiniHui_zhu and working with the amazing CritPt team: https://t.co/886QBq5oq6 Grateful to @haopeng_uiuc for great advice. Special thanks to @ArtificialAnlys for partnering with us! 7/7
0
0
6
To learn more details: Website: https://t.co/2gGzgOq4wi Paper: https://t.co/Vren4yZD2C GitHub repo: https://t.co/wy0ajx2IVe Huggingface: https://t.co/gYhm5tafs7 Additional evaluations by Artificial Analysis: https://t.co/9SAwjWuzeA 6/7
artificialanalysis.ai
Compare AI model performance on CritPt Benchmark Leaderboard. A benchmark designed to test LLMs on research-level physics reasoning tasks, featuring 71 composite research challenges.
1
0
6
Want to explore CritPt yourself? • See the example challenge, Quantum Error Detection, at: https://t.co/VJbv8Q2Lvr • Full challenge dataset available: https://t.co/gYhm5t9HCz • Evaluate your own model using our automated pipeline: https://t.co/wy0ajx2b5G
1
0
5
Physicists' verdict • LLMs are far from having the integrated rigor, creativity, and precision required to independently solve open physics research problems. • Plausible responses with subtle mistakes can be hard to identify and misleading in complex research
1
0
7
Reliability in High-stakes Research Contexts We introduce an additional metric, consistently solved rate, to measure reliability of LLMs’ reasoning on complex and open-ended research problems. To be 'consistently solved', models must answer correctly at least 4/5 times. We
1
0
8
Measuring Incremental Progress Hard research problems are often solved step by step. We break each full challenge into a sequence of checkpoints: subtasks or simpler variations of the main challenges. These checkpoints offer a new way to assign partial credits, which measures
1
0
6
Presenting the GLM-4.5 technical report!👇 https://t.co/QY1cAZxdwY This work demonstrates how we developed models that excel at reasoning, coding, and agentic tasks through a unique, multi-stage training paradigm. Key innovations include expert model iteration with
40
165
986
Can entropy minimization alone improve LLM performance? And how far can they go without any labeled data? This work answers both: yes, and surprisingly far 🐮 At inference EM can beat GPT4o Claude 3 opus & Gemini 1.5 pro on challenging scientific coding w/o any data/model update
12
65
404
Proud to see companies starting to use our SciCode to eval LMs. SciCode has some questions taken from Nobel-winning research in physics so it's super exciting to get more people to work on improving these abilities. https://t.co/2C4kWPE8md
Llama 4 independent evals: Maverick (402B total, 17B active) beats Claude 3.7 Sonnet, trails DeepSeek V3 but more efficient; Scout (109B total, 17B active) in-line with GPT-4o mini, ahead of Mistral Small 3.1 We have independently benchmarked Scout and Maverick as scoring 36 and
2
2
27
Congrats to o3-mini on setting a new high score on SciCode!! R1 clocks in at an impressive 4.6%, matching Claude 3.5. SciCode is our super-tough programming benchmark written by PhDs in various scientific domains.
10
3
42
SciCode is our super tough coding benchmark testing the abilities of LMs to program code based on research in physics/biology/material science/... o1 is the SoTA with 7%. To make it easier to use we're putting it into the Inspect AI format, as a few groups were asking for this.
4
9
50
Thanks everyone for coming to our poster yesterday! Lots of SWE-agent news coming soon. In 30 mins, with @MinyangTian1 et al we'll present SciCode, a super tough scientific coding benchmark that o1 gets 7% on. West Ballroom A-D #5204. Come through :)
1
3
46
SciCode is our new benchmark, with very tough programming challenges written by real scientists. https://t.co/2C4kWPE8md for more details.
2
3
34
Announcing Ofir's Gelato Challenge: At NeurIPS 2024, I will buy gelato for the team that has the highest combined score on SWE-bench Lite, AssistantBench, CiteME, and SciCode. Final submission is by December 3, 2024.
4
8
49