_OliverStanley Profile Banner
Oliver Stanley Profile
Oliver Stanley

@_OliverStanley

Followers
817
Following
4K
Media
36
Statuses
3K

ML engineer | open research | ML systems, RL, data

London
Joined May 2010
Don't wanna be here? Send us removal request.
@_OliverStanley
Oliver Stanley
2 months
Introducing Reasoning Gym: Over 100 procedurally generated reasoning environments for evaluation and RLVR of language models. Generate virtually infinite training or evaluation data with fine-grained difficulty control and automatic verifiers. đź§µ 1/
Tweet media one
3
44
274
@_OliverStanley
Oliver Stanley
30 days
Congrats to Prime on the release of their new open reasoning dataset. Nice to see Reasoning Gym tasks used extensively in SYNTHETIC-2!.
@PrimeIntellect
Prime Intellect
1 month
SYNTHETIC-2 Dataset.A next-gen open dataset for reasoning:.• Verified traces from DeepSeek-R1-0528 and Qwen3 for supervised fine-tuning.• Difficulty-annotated RL tasks via pass@k from smaller models.• 20+ diverse tasks with programmatic verifiers.• Includes non-verifiable
Tweet media one
0
0
2
@_OliverStanley
Oliver Stanley
1 month
This takes me back to 2023 building Open Assistant. Too many users for the limited GPUs we had for inference, so one idea was to prioritise users who provided more feedback data. Granular feedback from highly heterogenous human raters is very messy, though.
@scaling01
Lisan al Gaib
1 month
lmarena has a competitor. Yupp is basically lmarena, but with more granular feedback and a credit system. Each message costs you some credits, but when you give high-quality feedback you get credits back to use on your favorite models. This is their multi-turn (5+ messages) VIBE
Tweet media one
Tweet media two
0
0
1
@_OliverStanley
Oliver Stanley
1 month
Recent @SemiAnalysis_ post on RL touches on this. Designing better, more realistic RL environments feels like some of the highest impact work open-source could focus on right now.
Tweet card summary image
semianalysis.com
The test time scaling paradigm is thriving. Reasoning models continue to rapidly improve, and are becoming more effective and affordable. Evaluations measuring real world software engineering tasks…
0
0
0
@_OliverStanley
Oliver Stanley
2 months
Next big thing for RL with verifiable rewards (perhaps even the next "scaling axis" for models) looks likely to be spending more compute on the RL environment itself to improve simulation quality & realism. Is anyone working on this in the open?.
2
1
6
@_OliverStanley
Oliver Stanley
2 months
RT @zafstojano: Super excited to share 💪🧠Reasoning Gym! 🧵. We provide over 100 data generators and verifiers spanning several domains (alge….
0
22
0
@_OliverStanley
Oliver Stanley
2 months
@shizhediao @willccbb For more on our RLVR experiments with Reasoning Gym data, Zafir has an excellent thread here! 9/
@zafstojano
Zafir Stojanovski
2 months
Super excited to share đź’Şđź§ Reasoning Gym! đź§µ. We provide over 100 data generators and verifiers spanning several domains (algebra, arithmetic, code, geometry, logic, games) for training the next generation of reasoning models. In essence, we can generate an infinite amount of
Tweet media one
0
0
8
@_OliverStanley
Oliver Stanley
2 months
Really cool to see fast adoption of RG for training and eval! ProRL was recently released by researchers at NVIDIA including @shizhediao. Paper: RG is also already supported in @willccbb’s fantastic verifiers library! 8/
Tweet media one
1
1
10
@_OliverStanley
Oliver Stanley
2 months
If you’re interested in discussing or collaborating, feel free to reach out to one of us. RG was made possible by incredible work from @zafstojano, @joesharratt29, @jeankaddour, @neurosp1ke, Rich, and Abdulhakeem. 7/.
1
0
8
@_OliverStanley
Oliver Stanley
2 months
Our paper includes experiments on curriculum learning and how RLVR on RG generalizes between domains: We release code, training & eval configs, and example usage: Find RG on PyPI: 6/.
1
0
13
@_OliverStanley
Oliver Stanley
2 months
RG is not just useful for eval. We train Llama and Qwen 3B models with GRPO, finding improvements on common mathematics benchmarks. * We're aware of controversy regarding Qwen which emerged after our work was completed. The result for Qwen on MATH may be worth investigation. 5/
Tweet media one
1
0
8
@_OliverStanley
Oliver Stanley
2 months
Overall on hard configs, RG eval aligns well with common perceptions of model capabilities, with a few surprises. Reasoning models tend to outperform non-reasoning variants. (We would love to run evals on more models. If anyone can help with API credits, please reach out!) 4/
Tweet media one
1
0
8
@_OliverStanley
Oliver Stanley
2 months
We go beyond typical mathematics and code problems, building difficulty-adjustable generators for tasks including algorithms, logical reasoning, pattern recognition, and puzzle games. Here are just a few examples. 3/
Tweet media one
1
0
9
@_OliverStanley
Oliver Stanley
2 months
Common LLM benchmarks increasingly suffer from saturation, yet on several of our tasks configured for “hard” difficulty, recent models from frontier labs fail to solve even 10% of problems. And even this difficulty setting is far from the limit with RG. 2/
Tweet media one
2
0
8
@_OliverStanley
Oliver Stanley
4 months
Anyone know of any reasonably in-depth writing on Morgan McSweeney? Keen to understand his worldview.
1
0
0
@_OliverStanley
Oliver Stanley
4 months
RT @neurosp1ke: We now have a total of 101 datasets in reasoning-gym! 🧠💪.Big THANK YOU 💙 to all devs for making this possible, especially c….
0
18
0
@_OliverStanley
Oliver Stanley
5 months
AI safety researchers crafting prompts to intentionally induce behaviour that they can claim is "misaligned" is now a bigger phenomenon than actually misaligned AI, apparently
Tweet media one
@__Charlie_G
Charlie George
5 months
1/ People think it's cute when Claude 3 Opus fakes alignment to protect its animal welfare values. But here's a more troubling case: DeepSeek R1 faking alignment to block an "American AI company" from retraining the model to remove CCP propaganda.
Tweet media one
1
0
1
@_OliverStanley
Oliver Stanley
5 months
These guys might straight up drop all the optimisations needed to do inference ~as efficiently as frontier labs, fully open-source, over the next few days. First open-source smallish model with MLA will be a gamechanger. .
@deepseek_ai
DeepSeek
5 months
🚀 Day 1 of #OpenSourceWeek: FlashMLA. Honored to share FlashMLA - our efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences and now in production. ✅ BF16 support.✅ Paged KV cache (block size 64).⚡ 3000 GB/s memory-bound & 580 TFLOPS.
0
0
3
@_OliverStanley
Oliver Stanley
1 year
Underrated downside of the pensions triple lock: pensioners get a big giveaway every year by default, meaning nothing gets announced for them in the Budget, then they complain that they aren't getting anything.
0
0
4
@_OliverStanley
Oliver Stanley
2 years
Every lawyer is a policy failure.
2
0
3