Cameron R. Wolfe, Ph.D.
@cwolferesearch
Followers
28K
Following
8K
Media
801
Statuses
4K
Research @Netflix • Writer @ Deep (Learning) Focus • PhD @optimalab1 • I make AI understandable
Joined August 2021
Reinforcement Learning (RL) is quickly becoming the most important skill for AI researchers. Here are the best resources for learning RL for LLMs… TL;DR: RL is more important now than it has ever been, but (probably due to its complexity) there aren’t a ton of great resources
17
261
1K
The next AI Agents in Production conference is on November 18th. For those interested in the practical side of LLMs / agents, this is a good event to attend. Some highlights: - Completely free. - Everything can be viewed online. - Good talks from top companies (OAI, GDM, Meta,
1
2
13
For a full (and understandable) overview of PPO for LLMs, see my recent blog post: https://t.co/S5BSFjPGnt Here are the references from the post: [1] https://t.co/rOOElh9hI3 [2]
arxiv.org
Policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function...
0
1
7
Let’s implement Proximal Policy Optimization (PPO) together… Step #1 - Rollouts. We begin the PPO policy update with a batch of prompts. Using our current policy (i.e., the LLM we are training), we sample a single completion for each of these prompts. Additionally, we will
4
14
168
Nothing Special, just this... Vanguard: +595K shares. Forefront: +501K. Geode: +301K. UBS: +33K. Each bite-sized, each profitable. Collectively invested MILLIONS in $NXXT, casually turning into $18M. Wall Street eats in multiples. Retail eats in installments.
12
31
552
The memory folding mechanism proposed in this paper is great. It makes sense that agents should spend time explicitly compressing their memory into a semantic / organized format to avoid context explosion. Worth mentioning though that memory compression / retention in agents
7
26
138
assistive coding tools definitely make me more productive, but the pattern isn't uniform. biggest productivity boost comes later in the day / at night when I'm mentally exhausted. LLMs lower the barrier to entry for getting extra work done. validating or iterating on code with an
1
0
19
I can't believe I'm saying this - I'm officially a published author :D After three years, my first book is out. "AI for the Rest of Us" with @BloomsburyAcad is finally in the world. I wrote it because I watched too many people get left behind in AI conversations. The gap
8
6
60
"Through clever storytelling and illustration, [Sundaresan] brings technical concepts to life[.]" — Dr. Cameron R. Wolfe, Senior Research Scientist at Netflix (@cwolferesearch) Learn more: https://t.co/be2cECKogj
@DSaience
0
2
6
For full details, check out my new blog on PPO that I just released this morning (see image).
1
0
3
Proximal Policy Optimization (PPO) is one of the most common (and complicated) RL algorithms used for LLMs. Here’s how it works… TRPO. PPO is inspired by TRPO, which uses a constrained objective that: 1. Normalizes action / token probabilities of current policy by those of an
7
63
339
For more details, see the recent overview in my RL series. It covers REINFORCE variants, which heavily utilize the bandit formulation.
0
1
2
There are two common RL training formulations for LLMs: Markov Decision Process (MDP) formulation and bandit formulation. Here’s how they work… Background: We should recall that an LLM generates output via next token prediction; i.e., by sequentially generating each output
1
4
14
For more, check out the blog I just wrote on simple online RL algorithms (like REINFORCE and RLOO) for LLMs. Info in image.
0
1
3
The complexity of PPO leads practitioners to avoid online RL in favor of RL-free or offline algorithms (e.g., DPO), but why not just use simpler versions of online RL? TL;DR: REINFORCE and RLOO have been shown to work well for training LLMs. And, they do not require a value
1
4
13
For full details, checkout my writeup on the online-offline RL performance gap (details in image). Here are citations to papers mentioned above: [1] Xu, Shusheng, et al. "Is dpo superior to ppo for llm alignment? a comprehensive study." (2024). [2] Tajwar, Fahim, et al.
0
2
7
Is there truly a gap in performance between online and offline RL training for LLMs? Here’s what the research says… TL;DR: There is a clear performance gap between online and offline RL algorithms, especially in large-scale LLM training. However, this gap can be minimized by
14
39
238
My newsletter, Deep (Learning) Focus, recently passed 50,000 subscribers. Here are my four favorite articles and some reflections on my journey with the newsletter… (1) Demystifying Reasoning Models outlines the key details of training reasoning-based LLMs, focusing on the
3
7
35
For full details, I just published a 12k word overview that exhaustively covers every aspect of GPT-oss starting from LLM first principles (see image).
2
3
24
GPT-oss provides a rare peek into LLM research at OpenAI. Here are all of the technical details that OpenAI shared about the models… (1) Model architecture. GPT-oss has two models in its family–20b and 120b. They both use a Mixture-of-Experts (MoE) architecture. The 120b (20b)
2
68
325