
Timon Willi
@TimonWilli
Followers
336
Following
1K
Media
27
Statuses
160
RS @AIatMeta, DPhil w/ @j_foerst, @UniofOxford; Formerly: Research Intern @GoogleDeepMind / PhD @VectorInst / RS at @nnaisense / MSc w/ @SchmidhuberAI
London, United Kingdom
Joined May 2022
Scaling laws for (self) supervised learning predict: Increase parameter count -> performance goes brrrr. (loosely speaking) Can we get scaling laws for Deep Reinforcement Learning? In this work, we pave the way towards scaling laws for Deep Reinforcement Learning. We show that
📢Mixtures of Experts unlock parameter scaling for deep RL! Adding MoEs, and in particular Soft MoEs, to value-based deep RL agents results in more parameter-scalable models. Performance keeps increasing as we increase number of experts (green line below)! 1/9
3
10
39
🚨🚨Introducing the FLAIR internship program!🚨🚨 We are looking for two talented students to join us for an internship working in FLAIR for 6 months (5th January to 4th July 2026)! For details and eligibility criteria, please check:
foersterlab.com
We are looking for two talented students to join us for an internship working in FLAIR for 6 months. Students will get the chance to work on current FLAIR projects at the University of Oxford,...
2
21
119
In an evolving population of models, using model merging as the crossover operation drastically reduces diversity and leads to premature convergence. To address this, we make models compete for limited resources (training datapoints) which benefit models that have unique skills
What if we could evolve AI models like organisms in nature, letting them compete, mate, and combine their strengths to produce ever-fitter offspring? Excited to share our new work: “Competition and Attraction Improve Model Fusion” presented at GECCO’25🦎 where it was a runner-up
1
2
15
2
1
5
I recently had a lunch time conversation with a very senior AI researcher about how are multi-agent problems differ from single agent (their starting point was they do not). One point that made them think: As computers scale, the rest of the world (i.e. no agentic parts) is not
24
17
233
Unlock real diversity in your LLM! 🚀 LLM outputs can be boring and repetitive. Today, we release Intent Factored Generation (IFG) to: - Sample conceptually diverse outputs💡 - Improve performance on math and code reasoning tasks🤔 - Get more engaging conversational agents 🤖
1
11
36
🚨 Excited to share our #ICML2025 paper: "The Courage to Stop: Overcoming Sunk Cost Fallacy in Deep RL" We train RL agents to know when to quit, cutting wasted effort and improving efficiency with our method LEAST. 📄Paper: https://t.co/9ED3FubIPc 🧵Check the thread below👇🏾
Thrilled to share our #ICML2025 paper “The Courage to Stop: Overcoming Sunk Cost Fallacy in Deep RL”, led by Jiashun Liu and with other great collaborators! We teach RL agents when to quit wasting effort, boosting efficiency with our proposed method LEAST. Here's the story 🧵👇🏾
3
19
126
Since 1990, we have worked on artificial curiosity & measuring „interestingness.“ Our new ICML paper uses "Prediction of Hidden Units" loss to quantify in-context computational complexity in sequence models. It can tell boring from interesting tasks and predict correct reasoning.
Excited to share our new ICML paper, with co-authors @robert_csordas and @SchmidhuberAI! How can we tell if an LLM is actually "thinking" versus just spitting out memorized or trivial text? Can we detect when a model is doing anything interesting? (Thread below👇)
11
64
365
finally, an Opponent Shaping application I don’t have to make up for the intro section. dreams do come true.
Antiviral therapy design is myopic 🦠🙈 optimised only for the current strain. That's why you need a different Flu vaccine every year! Our #ICML2025 paper ADIOS proposes "shaper therapies" that steer viral evolution in our favour & remain effective. Work done @FLAIR_Ox 🧵👇
0
1
10
Tried to solve science, but solved humor instead. That’s why greatness cannot be planned.
The AI Scientist is far from perfect. Occasionally it makes embarrassing citation errors. Here, it incorrectly attributed “an LSTM-based neural network” to Goodfellow (2016) rather than to the correct authors, Hochreiter & Schmidhuber (1997). We documented these errors in own
1
0
28
congrats to the team!
The AI Scientist Generates its First Peer-Reviewed Scientific Publication We’re proud to announce that a paper produced by The AI Scientist-v2 passed the peer-review process at a workshop in ICLR, a top AI conference. Read more about this experiment → https://t.co/LpLYLnZMCQ
0
0
1
In the spirit of making more real world evals, here is the Factorio Learning Environment (FLE). Spurred by wanting to eval if models are good paperclip maximisers, we check how well agents build factories for other things 🏗️🏭🛠️
31
100
1K
🎉 Stoked to share The AI-Scientist 🧑🔬 - our end-to-end approach for conducting research with LLMs including ideation, coding, experiment execution, paper write-up & reviewing. Blog 📰: https://t.co/kBwAgvXDjZ Paper 📜: https://t.co/XvkwWfQhyi Code 💻: https://t.co/hXlXjxFAD9
Introducing The AI Scientist: The world’s first AI system for automating scientific research and open-ended discovery! https://t.co/8wVqIXVpZJ From ideation, writing code, running experiments and summarizing results, to writing entire papers and conducting peer-review, The AI
14
68
366
I’m pleased to announce our work which studies complexity phase transitions in neural networks! We track the Kolmogorov complexity of networks as they “grok”, and find a characteristic rise and fall of complexity, corresponding to memorization followed by generalization. 🧵
31
156
1K
🧑🔬 FLAIR is presenting three more great papers today at #NeurIPS2024! Come talk to us and find out what we've been doing!
1
5
20
🔬 FLAIR has a bunch of great papers being presented today at NeurIPS! Come along to learn more about the work! 👀 Keep your eyes peeled for more work being presented over the week!
1
5
13
There's light at the end of the tunnel of LLM evals: The light at the end of the tunnel:
Tired of saturated benchmarks? Want scope for a significant leap in capabilities? 🔥 Introducing BALROG: a Benchmark for Agentic LLM and VLM Reasoning On Games! BALROG is a challenging benchmark for LLM agentic capabilities, designed to stay relevant for years to come. 1/🧵
0
0
11
Tired of saturated benchmarks? Want scope for a significant leap in capabilities? 🔥 Introducing BALROG: a Benchmark for Agentic LLM and VLM Reasoning On Games! BALROG is a challenging benchmark for LLM agentic capabilities, designed to stay relevant for years to come. 1/🧵
8
43
220
We are very excited to announce Kinetix: an open-ended universe of physics-based tasks for RL! We use Kinetix to train a general agent on millions of randomly generated physics problems and show that this agent generalises to unseen handmade environments. 1/🧵
14
214
1K
What if improving LLM evaluation and generation was as simple as using a checklist? Introducing TICK ✅ (Targeted Instruct-evaluation with ChecKlists) and STICK 🏒 (Self-TICK) Work done @cohere with supervision from @_rockt, @j_foerst, @d_aumiller & @W4ngatang. 1/n
4
12
55