__Charlie_G Profile Banner
Charlie George Profile
Charlie George

@__Charlie_G

Followers
1K
Following
109
Media
19
Statuses
112

formerly ML @elicitorg

San Francisco
Joined November 2022
Don't wanna be here? Send us removal request.
@__Charlie_G
Charlie George
12 days
Today is my last at Elicit. I joined then Ought at 18 as an intern in the summer of 2022. Over the past three years, we’ve done early factored cognition research, spun out into a for-profit, rebuilt the Elicit codebase from the ground up, and shipped many core platform features.
1
0
4
@__Charlie_G
Charlie George
2 months
14/ Thanks to the @thinkymachines team for providing me early access to Tinker (and specifically for @johnschulman2 reaching out)! Really excited to see you all pushing for open science.
2
0
3
@__Charlie_G
Charlie George
2 months
12/ Looking ahead (my personal takes) - Pre-training really underutilises the richness of the internet. - I expect large capability gains in the next 1-3 years from weaker agents building RL environments for stronger successor models. - This may be a greater driver of
1
0
3
@__Charlie_G
Charlie George
2 months
11/ Limitations - Running high-compute RL against such a verifier might break due to adversarial inputs (though test-time scaling of the verifier could mitigate this). - Unclear if scaling RL on this data would improve performance—if it plateaus, this approach may be limited in
1
0
1
@__Charlie_G
Charlie George
2 months
10/ Thoughts - Early example of: Leveraging unstructured internet data (proofs) + weak model (GPT-4.1) + RL training on a stronger model (o4 mini) → results in an even stronger model (RL o4 mini). - Strong generalization only occurs with o4-mini, not weaker open-source models,
1
0
3
@__Charlie_G
Charlie George
2 months
9/ Best-of-N Sampling - Using best of 4 sampling using o4-mini w/ RL as the verifier with o3 and Gemini 2.5 Pro improves their performance on USAMO by 32% and 46% respectively. - For IMO the boost is 21.4% and 13.2%.
1
0
1
@__Charlie_G
Charlie George
2 months
8/ Generalisation to competition problems - IMO / USAMO score = model confidence (0 to 100%) scaled between 0-7. - The open-source models saw very minor improvement over the base model and didn't generalise from the training distribution much at all. - o4-mini w/ RL generalises
1
0
3
@__Charlie_G
Charlie George
2 months
7/ Training - Training is very stable with OAI's default hyperparameters for o4-mini - After tuning the GSPO hyperparameters on Tinker, the open-source models trained stably as well. - Both models saw their scores on the validation set improve as training progressed.
1
0
1
@__Charlie_G
Charlie George
2 months
6/ Reward Setup - Task the model with classifying if the proof is correct with confidence scores. - Reward structure: negative mean absolute error (MAE) + a slight penalty for incorrect formatting.
1
0
1
@__Charlie_G
Charlie George
2 months
5/ Data Generation - Start with real mathematical proofs from ProofWiki (2k from NaturalProofs dataset). - Use GPT-4.1 to strategically inject subtle mistakes using CoT. - GPT-4.1 also rates proofs' difficulty (1-5); I trained exclusively on difficulty level 2. - No IMO-specific
1
0
2
@__Charlie_G
Charlie George
2 months
4/ Here's the process for training o4-mini on the OpenAI RFT API and Qwen 3 8B, Qwen 3 30B and Llama 3.3 70B on Tinker to grade solutions to maths competition questions from other frontier models:
1
0
2
@__Charlie_G
Charlie George
2 months
3/ I implemented GSPO from scratch in about 500 lines of code in Tinker. It's very straightforward to do tons of parallel inference and add features like asynchronous evaluations of older checkpoints without blocking the main training loop.
1
0
1
@__Charlie_G
Charlie George
2 months
2/ What's Tinker? Tinker is a new RL API that gives you the flexibility of writing code that runs directly on a GPU cluster, but with the ease of using a Python API. The clever part is that you can define custom torch loss functions, run them on your CPU-based laptop device,
1
0
2
@__Charlie_G
Charlie George
2 months
1/ How do you verify complex AI outputs at scale without expert-labelled data? Working with @thinkymachines' new RL API Tinker, I've been expanding on some previous work I shared around using unstructured internet data to train models to grade IMO / USAMO solutions.
2
8
70
@BarrAlexandra
Alexandra Barr
2 months
Big, big paper is out!!!
@OpenAI
OpenAI
2 months
Today we’re introducing GDPval, a new evaluation that measures AI on real-world, economically valuable tasks. Evals ground progress in evidence instead of speculation and help track how AI improves at the kind of work that matters most. https://t.co/uKPPDldVNS
8
11
281
@kliu128
Kevin Liu
3 months
a new eval testing real bugs in research projects! models have a long way to go
@idavidrein
david rein
3 months
Interesting new eval: OPQA (OpenAI-Proof Q&A) uses bugs/problems that took an OAI team >1 day to solve
2
2
29
@__Charlie_G
Charlie George
4 months
9/ Happy to write up a blog post/share code if there’s sufficient interest in this project!
1
0
4