Minyang Tian ✈️ NeurIPS Profile
Minyang Tian ✈️ NeurIPS

@MinyangTian1

Followers
212
Following
7
Media
10
Statuses
30

Physics PhD candidate at UIUC, AI4Science, co-advised by @haopeng_uiuc and Eliu Huerta @argonne and @UChicago

Joined July 2024
Don't wanna be here? Send us removal request.
@MinyangTian1
Minyang Tian ✈️ NeurIPS
10 days
Can LLMs help physicists break new ground in real frontier research? We introduce CritPt (Complex Research using Integrated Thinking - Physics Test, pronounced "Critical Point"): the first benchmark of unpublished, realistic research-level reasoning challenges broadly spanning
13
22
156
@OfirPress
Ofir Press
10 days
Today we're launching one of the toughest benchmarks I've ever been part of creating: CritPt. It's basically our take on FrontierMath but for physics. The problems are original, research-level challenges written by actual scientists. Good luck to all competitors :)
@MinyangTian1
Minyang Tian ✈️ NeurIPS
10 days
Can LLMs help physicists break new ground in real frontier research? We introduce CritPt (Complex Research using Integrated Thinking - Physics Test, pronounced "Critical Point"): the first benchmark of unpublished, realistic research-level reasoning challenges broadly spanning
0
14
128
@MinyangTian1
Minyang Tian ✈️ NeurIPS
10 days
Huge thanks to @lifan__yuan for backing the CritPt eval and helping make it happen!
@lifan__yuan
Lifan Yuan at NeurIPS
10 days
SOTA models achieve ~10% on CritPt (most <3%). But what surprised (tortured😰) me most wasn't the difficulty—it was the quality control rigor. Experts spent 40h+ per problem drafting, then ~8 months iteration to ensure it's leakage-resistant yet easily verifiable. Thread below🧵
1
0
4
@ArtificialAnlys
Artificial Analysis
10 days
We’re launching a new frontier physics eval on Artificial Analysis where no model achieves greater than 9%: CritPt (Complex Research using Integrated Thinking - Physics Test) Developed by 60+ researchers from 30+ institutions across the world including the Argonne National
35
128
943
@MinyangTian1
Minyang Tian ✈️ NeurIPS
10 days
It’s been an incredible journey co-leading this project with @MiniHui_zhu and working with the amazing CritPt team: https://t.co/886QBq5oq6 Grateful to @haopeng_uiuc for great advice. Special thanks to @ArtificialAnlys for partnering with us! 7/7
0
0
6
@MinyangTian1
Minyang Tian ✈️ NeurIPS
10 days
Want to explore CritPt yourself? • See the example challenge, Quantum Error Detection, at: https://t.co/VJbv8Q2Lvr • Full challenge dataset available: https://t.co/gYhm5t9HCz • Evaluate your own model using our automated pipeline: https://t.co/wy0ajx2b5G
1
0
5
@MinyangTian1
Minyang Tian ✈️ NeurIPS
10 days
Physicists' verdict • LLMs are far from having the integrated rigor, creativity, and precision required to independently solve open physics research problems. • Plausible responses with subtle mistakes can be hard to identify and misleading in complex research
1
0
7
@MinyangTian1
Minyang Tian ✈️ NeurIPS
10 days
Reliability in High-stakes Research Contexts We introduce an additional metric, consistently solved rate, to measure reliability of LLMs’ reasoning on complex and open-ended research problems. To be 'consistently solved', models must answer correctly at least 4/5 times. We
1
0
8
@MinyangTian1
Minyang Tian ✈️ NeurIPS
10 days
Measuring Incremental Progress Hard research problems are often solved step by step. We break each full challenge into a sequence of checkpoints: subtasks or simpler variations of the main challenges. These checkpoints offer a new way to assign partial credits, which measures
1
0
6
@Zai_org
Z.ai
4 months
Presenting the GLM-4.5 technical report!👇 https://t.co/QY1cAZxdwY This work demonstrates how we developed models that excel at reasoning, coding, and agentic tasks through a unique, multi-stage training paradigm. Key innovations include expert model iteration with
40
165
986
@Shivamag12
Shivam Agarwal
6 months
Can entropy minimization alone improve LLM performance? And how far can they go without any labeled data? This work answers both: yes, and surprisingly far 🐮 At inference EM can beat GPT4o Claude 3 opus & Gemini 1.5 pro on challenging scientific coding w/o any data/model update
12
65
404
@OfirPress
Ofir Press
8 months
Proud to see companies starting to use our SciCode to eval LMs. SciCode has some questions taken from Nobel-winning research in physics so it's super exciting to get more people to work on improving these abilities. https://t.co/2C4kWPE8md
@ArtificialAnlys
Artificial Analysis
8 months
Llama 4 independent evals: Maverick (402B total, 17B active) beats Claude 3.7 Sonnet, trails DeepSeek V3 but more efficient; Scout (109B total, 17B active) in-line with GPT-4o mini, ahead of Mistral Small 3.1 We have independently benchmarked Scout and Maverick as scoring 36 and
2
2
27
@OfirPress
Ofir Press
10 months
Congrats to o3-mini on setting a new high score on SciCode!! R1 clocks in at an impressive 4.6%, matching Claude 3.5. SciCode is our super-tough programming benchmark written by PhDs in various scientific domains.
10
3
42
@OfirPress
Ofir Press
10 months
SciCode is our super tough coding benchmark testing the abilities of LMs to program code based on research in physics/biology/material science/... o1 is the SoTA with 7%. To make it easier to use we're putting it into the Inspect AI format, as a few groups were asking for this.
4
9
50
@OfirPress
Ofir Press
1 year
Thanks everyone for coming to our poster yesterday! Lots of SWE-agent news coming soon. In 30 mins, with @MinyangTian1 et al we'll present SciCode, a super tough scientific coding benchmark that o1 gets 7% on. West Ballroom A-D #5204. Come through :)
1
3
46
@MinyangTian1
Minyang Tian ✈️ NeurIPS
1 year
We're presenting SciCode tomorrow (Thu) at the 11 AM poster session, West Ballroom A-D #5204
0
0
6
@AkariAsai
Akari Asai
1 year
1/ Introducing ᴏᴘᴇɴꜱᴄʜᴏʟᴀʀ: a retrieval-augmented LM to help scientists synthesize knowledge 📚 @uwnlp @allen_ai With open models & 45M-paper datastores, it outperforms proprietary systems & match human experts. Try out our demo! We also introduce ꜱᴄʜᴏʟᴀʀQᴀʙᴇɴᴄʜ,
38
292
1K
@OfirPress
Ofir Press
1 year
SciCode is our new benchmark, with very tough programming challenges written by real scientists. https://t.co/2C4kWPE8md for more details.
2
3
34
@OfirPress
Ofir Press
1 year
Announcing Ofir's Gelato Challenge: At NeurIPS 2024, I will buy gelato for the team that has the highest combined score on SWE-bench Lite, AssistantBench, CiteME, and SciCode. Final submission is by December 3, 2024.
4
8
49