Charlie George @__Charlie_G X Profile

Charlie George

@__Charlie_G

Followers

1K

Following

109

Media

19

Statuses

112

formerly ML @elicitorg

https://t.co/l9rkxlpRPC

San Francisco

Joined November 2022

Don't wanna be here? Send us removal request.

Charlie George

@__Charlie_G

12 days

Today is my last at Elicit. I joined then Ought at 18 as an intern in the summer of 2022. Over the past three years, we’ve done early factored cognition research, spun out into a for-profit, rebuilt the Elicit codebase from the ground up, and shipped many core platform features.

1

0

4

Charlie George

@__Charlie_G

2 months

14/ Thanks to the @thinkymachines team for providing me early access to Tinker (and specifically for @johnschulman2 reaching out)! Really excited to see you all pushing for open science.

2

0

3

Charlie George

@__Charlie_G

2 months

13/ Link to code / blog Full blog post: https://t.co/OzUJ7S2to0 Code:

github.com

Contribute to CG80499/IMO_grader development by creating an account on GitHub.

1

0

2

Charlie George

@__Charlie_G

2 months

12/ Looking ahead (my personal takes) - Pre-training really underutilises the richness of the internet. - I expect large capability gains in the next 1-3 years from weaker agents building RL environments for stronger successor models. - This may be a greater driver of

1

0

3

Charlie George

@__Charlie_G

2 months

11/ Limitations - Running high-compute RL against such a verifier might break due to adversarial inputs (though test-time scaling of the verifier could mitigate this). - Unclear if scaling RL on this data would improve performance—if it plateaus, this approach may be limited in

1

0

1

Charlie George

@__Charlie_G

2 months

10/ Thoughts - Early example of: Leveraging unstructured internet data (proofs) + weak model (GPT-4.1) + RL training on a stronger model (o4 mini) → results in an even stronger model (RL o4 mini). - Strong generalization only occurs with o4-mini, not weaker open-source models,

1

0

3

Charlie George

@__Charlie_G

2 months

9/ Best-of-N Sampling - Using best of 4 sampling using o4-mini w/ RL as the verifier with o3 and Gemini 2.5 Pro improves their performance on USAMO by 32% and 46% respectively. - For IMO the boost is 21.4% and 13.2%.

1

0

1

Charlie George

@__Charlie_G

2 months

8/ Generalisation to competition problems - IMO / USAMO score = model confidence (0 to 100%) scaled between 0-7. - The open-source models saw very minor improvement over the base model and didn't generalise from the training distribution much at all. - o4-mini w/ RL generalises

1

0

3

Charlie George

@__Charlie_G

2 months

7/ Training - Training is very stable with OAI's default hyperparameters for o4-mini - After tuning the GSPO hyperparameters on Tinker, the open-source models trained stably as well. - Both models saw their scores on the validation set improve as training progressed.

1

0

1

Charlie George

@__Charlie_G

2 months

6/ Reward Setup - Task the model with classifying if the proof is correct with confidence scores. - Reward structure: negative mean absolute error (MAE) + a slight penalty for incorrect formatting.

1

0

1

Charlie George

@__Charlie_G

2 months

5/ Data Generation - Start with real mathematical proofs from ProofWiki (2k from NaturalProofs dataset). - Use GPT-4.1 to strategically inject subtle mistakes using CoT. - GPT-4.1 also rates proofs' difficulty (1-5); I trained exclusively on difficulty level 2. - No IMO-specific

1

0

2

Charlie George

@__Charlie_G

2 months

4/ Here's the process for training o4-mini on the OpenAI RFT API and Qwen 3 8B, Qwen 3 30B and Llama 3.3 70B on Tinker to grade solutions to maths competition questions from other frontier models:

1

0

2

Charlie George

@__Charlie_G

2 months

3/ I implemented GSPO from scratch in about 500 lines of code in Tinker. It's very straightforward to do tons of parallel inference and add features like asynchronous evaluations of older checkpoints without blocking the main training loop.

1

0

1

Charlie George

@__Charlie_G

2 months

2/ What's Tinker? Tinker is a new RL API that gives you the flexibility of writing code that runs directly on a GPU cluster, but with the ease of using a Python API. The clever part is that you can define custom torch loss functions, run them on your CPU-based laptop device,

1

0

2

Charlie George

@__Charlie_G

2 months

1/ How do you verify complex AI outputs at scale without expert-labelled data? Working with @thinkymachines' new RL API Tinker, I've been expanding on some previous work I shared around using unstructured internet data to train models to grade IMO / USAMO solutions.

2

8

70

Alexandra Barr

@BarrAlexandra

2 months

Big, big paper is out!!!

OpenAI

@OpenAI

2 months

Today we’re introducing GDPval, a new evaluation that measures AI on real-world, economically valuable tasks. Evals ground progress in evidence instead of speculation and help track how AI improves at the kind of work that matters most. https://t.co/uKPPDldVNS

8

11

281

Charlie George

@__Charlie_G

3 months

10/ The code and data can now be found here! https://t.co/F9CbncnCWg

github.com

Contribute to CG80499/IMO_grader development by creating an account on GitHub.

0

2