Charlie George
@__Charlie_G
Followers
1K
Following
109
Media
19
Statuses
112
formerly ML @elicitorg
San Francisco
Joined November 2022
Today is my last at Elicit. I joined then Ought at 18 as an intern in the summer of 2022. Over the past three years, we’ve done early factored cognition research, spun out into a for-profit, rebuilt the Elicit codebase from the ground up, and shipped many core platform features.
1
0
4
14/ Thanks to the @thinkymachines team for providing me early access to Tinker (and specifically for @johnschulman2 reaching out)! Really excited to see you all pushing for open science.
2
0
3
13/ Link to code / blog Full blog post: https://t.co/OzUJ7S2to0 Code:
github.com
Contribute to CG80499/IMO_grader development by creating an account on GitHub.
1
0
2
12/ Looking ahead (my personal takes) - Pre-training really underutilises the richness of the internet. - I expect large capability gains in the next 1-3 years from weaker agents building RL environments for stronger successor models. - This may be a greater driver of
1
0
3
11/ Limitations - Running high-compute RL against such a verifier might break due to adversarial inputs (though test-time scaling of the verifier could mitigate this). - Unclear if scaling RL on this data would improve performance—if it plateaus, this approach may be limited in
1
0
1
10/ Thoughts - Early example of: Leveraging unstructured internet data (proofs) + weak model (GPT-4.1) + RL training on a stronger model (o4 mini) → results in an even stronger model (RL o4 mini). - Strong generalization only occurs with o4-mini, not weaker open-source models,
1
0
3
9/ Best-of-N Sampling - Using best of 4 sampling using o4-mini w/ RL as the verifier with o3 and Gemini 2.5 Pro improves their performance on USAMO by 32% and 46% respectively. - For IMO the boost is 21.4% and 13.2%.
1
0
1
8/ Generalisation to competition problems - IMO / USAMO score = model confidence (0 to 100%) scaled between 0-7. - The open-source models saw very minor improvement over the base model and didn't generalise from the training distribution much at all. - o4-mini w/ RL generalises
1
0
3
7/ Training - Training is very stable with OAI's default hyperparameters for o4-mini - After tuning the GSPO hyperparameters on Tinker, the open-source models trained stably as well. - Both models saw their scores on the validation set improve as training progressed.
1
0
1
6/ Reward Setup - Task the model with classifying if the proof is correct with confidence scores. - Reward structure: negative mean absolute error (MAE) + a slight penalty for incorrect formatting.
1
0
1
5/ Data Generation - Start with real mathematical proofs from ProofWiki (2k from NaturalProofs dataset). - Use GPT-4.1 to strategically inject subtle mistakes using CoT. - GPT-4.1 also rates proofs' difficulty (1-5); I trained exclusively on difficulty level 2. - No IMO-specific
1
0
2
4/ Here's the process for training o4-mini on the OpenAI RFT API and Qwen 3 8B, Qwen 3 30B and Llama 3.3 70B on Tinker to grade solutions to maths competition questions from other frontier models:
1
0
2
3/ I implemented GSPO from scratch in about 500 lines of code in Tinker. It's very straightforward to do tons of parallel inference and add features like asynchronous evaluations of older checkpoints without blocking the main training loop.
1
0
1
2/ What's Tinker? Tinker is a new RL API that gives you the flexibility of writing code that runs directly on a GPU cluster, but with the ease of using a Python API. The clever part is that you can define custom torch loss functions, run them on your CPU-based laptop device,
1
0
2
1/ How do you verify complex AI outputs at scale without expert-labelled data? Working with @thinkymachines' new RL API Tinker, I've been expanding on some previous work I shared around using unstructured internet data to train models to grade IMO / USAMO solutions.
2
8
70
Big, big paper is out!!!
Today we’re introducing GDPval, a new evaluation that measures AI on real-world, economically valuable tasks. Evals ground progress in evidence instead of speculation and help track how AI improves at the kind of work that matters most. https://t.co/uKPPDldVNS
8
11
281
10/ The code and data can now be found here! https://t.co/F9CbncnCWg
github.com
Contribute to CG80499/IMO_grader development by creating an account on GitHub.
0
0
2
9/ Happy to write up a blog post/share code if there’s sufficient interest in this project!
1
0
4