cwolferesearch Profile Banner
Cameron R. Wolfe, Ph.D. Profile
Cameron R. Wolfe, Ph.D.

@cwolferesearch

Followers
28K
Following
8K
Media
801
Statuses
4K

Research @Netflix • Writer @ Deep (Learning) Focus • PhD @optimalab1 • I make AI understandable

Joined August 2021
Don't wanna be here? Send us removal request.
@cwolferesearch
Cameron R. Wolfe, Ph.D.
7 months
Reinforcement Learning (RL) is quickly becoming the most important skill for AI researchers. Here are the best resources for learning RL for LLMs… TL;DR: RL is more important now than it has ever been, but (probably due to its complexity) there aren’t a ton of great resources
17
261
1K
@cwolferesearch
Cameron R. Wolfe, Ph.D.
5 days
The next AI Agents in Production conference is on November 18th. For those interested in the practical side of LLMs / agents, this is a good event to attend. Some highlights: - Completely free. - Everything can be viewed online. - Good talks from top companies (OAI, GDM, Meta,
1
2
13
@cwolferesearch
Cameron R. Wolfe, Ph.D.
7 days
Let’s implement Proximal Policy Optimization (PPO) together… Step #1 - Rollouts. We begin the PPO policy update with a batch of prompts. Using our current policy (i.e., the LLM we are training), we sample a single completion for each of these prompts. Additionally, we will
4
14
168
@valent44355
Victor Renard
21 hours
Nothing Special, just this... Vanguard: +595K shares. Forefront: +501K. Geode: +301K. UBS: +33K. Each bite-sized, each profitable. Collectively invested MILLIONS in $NXXT, casually turning into $18M. Wall Street eats in multiples. Retail eats in installments.
12
31
552
@cwolferesearch
Cameron R. Wolfe, Ph.D.
7 days
Couldn't be more excited for better interaction between X / substack. Check out my newsletter here:
@cjgbest
Chris Best
8 days
Update 2: even correcting for the fake views, traffic to Substack links from X is up substantially. (full post reads, signups, etc. also track.) We're so back!
1
0
7
@cwolferesearch
Cameron R. Wolfe, Ph.D.
8 days
The memory folding mechanism proposed in this paper is great. It makes sense that agents should spend time explicitly compressing their memory into a semantic / organized format to avoid context explosion. Worth mentioning though that memory compression / retention in agents
7
26
138
@cwolferesearch
Cameron R. Wolfe, Ph.D.
8 days
assistive coding tools definitely make me more productive, but the pattern isn't uniform. biggest productivity boost comes later in the day / at night when I'm mentally exhausted. LLMs lower the barrier to entry for getting extra work done. validating or iterating on code with an
1
0
19
@cwolferesearch
Cameron R. Wolfe, Ph.D.
11 days
The value of RL is very clearly / nicely articulated by DeepSeekMath… - RL enhances maj@k (majority vote), but not pass@k. - RL boosts the probability of correct completions that are already in top-k. - RL does NOT clearly enhance model capabilities.
9
14
156
@DSaience
Sairam Sundaresan
13 days
I can't believe I'm saying this - I'm officially a published author :D After three years, my first book is out. "AI for the Rest of Us" with @BloomsburyAcad is finally in the world. I wrote it because I watched too many people get left behind in AI conversations. The gap
8
6
60
@BloomsburyAcad
Bloomsbury Academic
13 days
"Through clever storytelling and illustration, [Sundaresan] brings technical concepts to life[.]" — Dr. Cameron R. Wolfe, Senior Research Scientist at Netflix (@cwolferesearch) Learn more: https://t.co/be2cECKogj @DSaience
0
2
6
@cwolferesearch
Cameron R. Wolfe, Ph.D.
16 days
For full details, check out my new blog on PPO that I just released this morning (see image).
1
0
3
@cwolferesearch
Cameron R. Wolfe, Ph.D.
16 days
Proximal Policy Optimization (PPO) is one of the most common (and complicated) RL algorithms used for LLMs. Here’s how it works… TRPO. PPO is inspired by TRPO, which uses a constrained objective that: 1. Normalizes action / token probabilities of current policy by those of an
7
63
339
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 month
For more details, see the recent overview in my RL series. It covers REINFORCE variants, which heavily utilize the bandit formulation.
0
1
2
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 month
There are two common RL training formulations for LLMs: Markov Decision Process (MDP) formulation and bandit formulation. Here’s how they work… Background: We should recall that an LLM generates output via next token prediction; i.e., by sequentially generating each output
1
4
14
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 month
For more, check out the blog I just wrote on simple online RL algorithms (like REINFORCE and RLOO) for LLMs. Info in image.
0
1
3
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 month
The complexity of PPO leads practitioners to avoid online RL in favor of RL-free or offline algorithms (e.g., DPO), but why not just use simpler versions of online RL? TL;DR: REINFORCE and RLOO have been shown to work well for training LLMs. And, they do not require a value
1
4
13
@cwolferesearch
Cameron R. Wolfe, Ph.D.
2 months
For full details, checkout my writeup on the online-offline RL performance gap (details in image). Here are citations to papers mentioned above: [1] Xu, Shusheng, et al. "Is dpo superior to ppo for llm alignment? a comprehensive study." (2024). [2] Tajwar, Fahim, et al.
0
2
7
@cwolferesearch
Cameron R. Wolfe, Ph.D.
2 months
Is there truly a gap in performance between online and offline RL training for LLMs? Here’s what the research says… TL;DR: There is a clear performance gap between online and offline RL algorithms, especially in large-scale LLM training. However, this gap can be minimized by
14
39
238
@cwolferesearch
Cameron R. Wolfe, Ph.D.
2 months
My newsletter, Deep (Learning) Focus, recently passed 50,000 subscribers. Here are my four favorite articles and some reflections on my journey with the newsletter… (1) Demystifying Reasoning Models outlines the key details of training reasoning-based LLMs, focusing on the
3
7
35
@cwolferesearch
Cameron R. Wolfe, Ph.D.
3 months
For full details, I just published a 12k word overview that exhaustively covers every aspect of GPT-oss starting from LLM first principles (see image).
2
3
24
@cwolferesearch
Cameron R. Wolfe, Ph.D.
3 months
GPT-oss provides a rare peek into LLM research at OpenAI. Here are all of the technical details that OpenAI shared about the models… (1) Model architecture. GPT-oss has two models in its family–20b and 120b. They both use a Mixture-of-Experts (MoE) architecture. The 120b (20b)
2
68
325