Oliver Stanley @_OliverStanley X Profile

Oliver Stanley

@_OliverStanley

Followers

817

Following

4K

Media

36

Statuses

3K

ML engineer | open research | ML systems, RL, data

London

Joined May 2010

Don't wanna be here? Send us removal request.

Oliver Stanley

@_OliverStanley

2 months

Introducing Reasoning Gym: Over 100 procedurally generated reasoning environments for evaluation and RLVR of language models. Generate virtually infinite training or evaluation data with fine-grained difficulty control and automatic verifiers. 🧵 1/

3

44

274

Oliver Stanley

@_OliverStanley

30 days

Congrats to Prime on the release of their new open reasoning dataset. Nice to see Reasoning Gym tasks used extensively in SYNTHETIC-2!.

Prime Intellect

@PrimeIntellect

1 month

SYNTHETIC-2 Dataset.A next-gen open dataset for reasoning:.• Verified traces from DeepSeek-R1-0528 and Qwen3 for supervised fine-tuning.• Difficulty-annotated RL tasks via pass@k from smaller models.• 20+ diverse tasks with programmatic verifiers.• Includes non-verifiable

0

2

Oliver Stanley

@_OliverStanley

1 month

This takes me back to 2023 building Open Assistant. Too many users for the limited GPUs we had for inference, so one idea was to prioritise users who provided more feedback data. Granular feedback from highly heterogenous human raters is very messy, though.

Lisan al Gaib

@scaling01

1 month

lmarena has a competitor. Yupp is basically lmarena, but with more granular feedback and a credit system. Each message costs you some credits, but when you give high-quality feedback you get credits back to use on your favorite models. This is their multi-turn (5+ messages) VIBE

0

1

Oliver Stanley

@_OliverStanley

1 month

Recent @SemiAnalysis_ post on RL touches on this. Designing better, more realistic RL environments feels like some of the highest impact work open-source could focus on right now.

semianalysis.com

The test time scaling paradigm is thriving. Reasoning models continue to rapidly improve, and are becoming more effective and affordable. Evaluations measuring real world software engineering tasks…

0

Oliver Stanley

@_OliverStanley

2 months

Next big thing for RL with verifiable rewards (perhaps even the next "scaling axis" for models) looks likely to be spending more compute on the RL environment itself to improve simulation quality & realism. Is anyone working on this in the open?.

2

1

6

Oliver Stanley

@_OliverStanley

2 months

RT @zafstojano: Super excited to share 💪🧠Reasoning Gym! 🧵. We provide over 100 data generators and verifiers spanning several domains (alge….

0

22

0

Oliver Stanley

@_OliverStanley

2 months

@shizhediao @willccbb For more on our RLVR experiments with Reasoning Gym data, Zafir has an excellent thread here! 9/

Zafir Stojanovski

@zafstojano

2 months

Super excited to share 💪🧠Reasoning Gym! 🧵. We provide over 100 data generators and verifiers spanning several domains (algebra, arithmetic, code, geometry, logic, games) for training the next generation of reasoning models. In essence, we can generate an infinite amount of

0

8

Oliver Stanley

@_OliverStanley

2 months

Really cool to see fast adoption of RG for training and eval! ProRL was recently released by researchers at NVIDIA including @shizhediao. Paper: RG is also already supported in @willccbb’s fantastic verifiers library! 8/

1

10

Oliver Stanley

@_OliverStanley

2 months

If you’re interested in discussing or collaborating, feel free to reach out to one of us. RG was made possible by incredible work from @zafstojano, @joesharratt29, @jeankaddour, @neurosp1ke, Rich, and Abdulhakeem. 7/.

1

0

8

Oliver Stanley

@_OliverStanley

2 months

Our paper includes experiments on curriculum learning and how RLVR on RG generalizes between domains: We release code, training & eval configs, and example usage: Find RG on PyPI: 6/.

1

0

13

Oliver Stanley

@_OliverStanley

2 months

RG is not just useful for eval. We train Llama and Qwen 3B models with GRPO, finding improvements on common mathematics benchmarks. * We're aware of controversy regarding Qwen which emerged after our work was completed. The result for Qwen on MATH may be worth investigation. 5/

1

0

8

Oliver Stanley

@_OliverStanley

2 months

Overall on hard configs, RG eval aligns well with common perceptions of model capabilities, with a few surprises. Reasoning models tend to outperform non-reasoning variants. (We would love to run evals on more models. If anyone can help with API credits, please reach out!) 4/

1

0

8

Oliver Stanley

@_OliverStanley

2 months

We go beyond typical mathematics and code problems, building difficulty-adjustable generators for tasks including algorithms, logical reasoning, pattern recognition, and puzzle games. Here are just a few examples. 3/

1

0

9

Oliver Stanley

@_OliverStanley

2 months

Common LLM benchmarks increasingly suffer from saturation, yet on several of our tasks configured for “hard” difficulty, recent models from frontier labs fail to solve even 10% of problems. And even this difficulty setting is far from the limit with RG. 2/

2

0

8

Oliver Stanley

@_OliverStanley

4 months

Anyone know of any reasonably in-depth writing on Morgan McSweeney? Keen to understand his worldview.

1

0

Oliver Stanley

@_OliverStanley

4 months

RT @neurosp1ke: We now have a total of 101 datasets in reasoning-gym! 🧠💪.Big THANK YOU 💙 to all devs for making this possible, especially c….

0

18

0

Oliver Stanley

@_OliverStanley

5 months

AI safety researchers crafting prompts to intentionally induce behaviour that they can claim is "misaligned" is now a bigger phenomenon than actually misaligned AI, apparently

Charlie George

@__Charlie_G

5 months

1/ People think it's cute when Claude 3 Opus fakes alignment to protect its animal welfare values. But here's a more troubling case: DeepSeek R1 faking alignment to block an "American AI company" from retraining the model to remove CCP propaganda.

1

0

1

Oliver Stanley

@_OliverStanley

5 months

These guys might straight up drop all the optimisations needed to do inference ~as efficiently as frontier labs, fully open-source, over the next few days. First open-source smallish model with MLA will be a gamechanger. .

DeepSeek

@deepseek_ai

5 months

🚀 Day 1 of #OpenSourceWeek: FlashMLA. Honored to share FlashMLA - our efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences and now in production. ✅ BF16 support.✅ Paged KV cache (block size 64).⚡ 3000 GB/s memory-bound & 580 TFLOPS.

0

3

Oliver Stanley

@_OliverStanley

1 year

Underrated downside of the pensions triple lock: pensioners get a big giveaway every year by default, meaning nothing gets announced for them in the Budget, then they complain that they aren't getting anything.

0

4

Oliver Stanley

@_OliverStanley

2 years

Every lawyer is a policy failure.

2

0

3