Nolan Koblischke
@astro_nolan
Followers
382
Following
2K
Media
54
Statuses
425
Language models and astrophysics. PhD student @UofT, formerly @UBC, @EPFL also researching @PolymathicAI
z = 0
Joined July 2015
You've seen robots trained in simulated environments; what if we could do the same for AI scientists? In our ICML 2025 paper, we introduce GravityBench, a benchmark created to test AI's scientific capabilities through physics simulations. /n
1
2
14
Since plot reading is a necessary skill for scientific research, this is a big deal!
0
0
1
GPT-5.2 seems to have fantastic plot reading capabilities, even better than Gemini 3 Pro. I introduced this challenge almost exactly one year ago and it looks like it's basically solved! 🪩 Of course, I will have slightly harder plot reading tasks to share :)
Gemini 3.0 Pro still has difficulties reading simple plots. As shown by asking it to "pick 10 points that lie on this curve". They added a new `media_resolution` parameter which I set to high for this test.
1
0
6
Weekend hack: I tried RL finetuning on my "choose 10 points that lie on this curve" problem, which all the frontier models struggle at. It's a work-in-progress! Specifically I tuned Qwen2.5-VL-3B-Instruct with GRPOTrainer from HuggingFace's TRL. I found that if I only had a
Claude is still Claude-y, fantastic at coding / agentic tasks but mid at vision. As found in @EpochAIResearch
@GregHBurnham
https://t.co/yy28zYFzjd
1
0
4
Christine had the awesome idea of collecting 20 astrophysics papers and seeing whether LLMs could replicate the findings. What amazed me is how quickly she pulled together a team and made it happen! Now we have a solid benchmark for the community to put models to the test!
Can frontier language model agents replicate astrophysics research papers? Clearly not yet -- but models are slowly getting better! Excited to finally put out ReplicationBench, the work of an awesome team of astrophysicists from across Stanford's KIPAC, SLAC, and C4DU.
0
1
6
Claude is still Claude-y, fantastic at coding / agentic tasks but mid at vision. As found in @EpochAIResearch
@GregHBurnham
https://t.co/yy28zYFzjd
Gemini 3.0 Pro still has difficulties reading simple plots. As shown by asking it to "pick 10 points that lie on this curve". They added a new `media_resolution` parameter which I set to high for this test.
1
0
4
Gemini 3.0 Pro still has difficulties reading simple plots. As shown by asking it to "pick 10 points that lie on this curve". They added a new `media_resolution` parameter which I set to high for this test.
GPT-5 (and all models I've tried) have difficulty reading plots. I generated the blue curve and asked the model to select 10 x,y points that lie on the curve shown in the image, which I've plotted in red.
0
0
2
I'm obsessed with vision capabilities of language models, totally underappreciated. I had a conversation with an employee at a lab who argued that improving vision does not help automate AI research, so it's not a focus. But I'd be much worse at research if I was blindfolded.
2
0
5
Also, NYC is a really fun city. Especially with intern friends who want to make the most out of the summer! had a great time.
0
0
2
It was such a great internship! So excited to share more soon 🌌🔭💻
The final days of summer are upon us, and it is bittersweet to say goodbye to our great group of @PolymathicAI interns! 😭 @JacopoTeneggi @cskokgibbs @astro_nolan @CristianaD2202 @LouisSerrano31 @rachelczhang Here are a few pics to remind us all the fun we had! (and hold your
1
0
5
Inspired by @kdqg1's suggestion, I ran this test 1000 times with GPT-4.1 and found it occasionally succeeds, proving the capability exists! This suggests RL fine-tuning could improve plot-reading abilities. Future work for my pet project 😄
0
0
2
GPT-5 (and all models I've tried) have difficulty reading plots. I generated the blue curve and asked the model to select 10 x,y points that lie on the curve shown in the image, which I've plotted in red.
2
0
7
Thanks everyone who came to my poster @icmlconf. I'm so happy to feel the excitement about using physics simulations to test and train science agents.
0
2
15
I’ll be at ICML next week in Vancouver - hit me up if you’d like to chat about using LLMs in (astro)physics research! I’ll be presenting a poster on GravityBench Thursday July 17 4:30pm-7pm. East Exhibition Hall A-B #E-2504
You've seen robots trained in simulated environments; what if we could do the same for AI scientists? In our ICML 2025 paper, we introduce GravityBench, a benchmark created to test AI's scientific capabilities through physics simulations. /n
0
1
17
Looking ahead, we're excited to use simulations - where the ground truth is known by construction - to test if AI agents can recover physical parameters. Combining such simulations with RL training could be a major step towards AI Scientists that truly use the scientific method.
0
0
2
Interestingly, many AI models rush their conclusions, even o4-mini-high only uses 33 of their available 100 observations on average. They often make arbitrary assumptions (like assuming a star mass is 1 kg!) to move quickly through the problem.
1
1
1
The twist: each agent is restricted to just 100 observations, mimicking the real-world limitations scientists face. This tests not only their scientific reasoning and coding abilities but also their ability to strategically plan their observations.
1
0
0
We use binary star systems as our simulated environment, challenging AI agents with difficult tasks, such as determining how we've altered the force of gravity. Paper: https://t.co/1dhFY6ld1K Website: https://t.co/pcEkYTmnH7 Code: https://t.co/34TCMIw5RB
1
0
0