Cameron R. Wolfe, Ph.D. @cwolferesearch X Profile

Cameron R. Wolfe, Ph.D.

@cwolferesearch

Followers

28K

Following

8K

Media

801

Statuses

4K

Research @Netflix • Writer @ Deep (Learning) Focus • PhD @optimalab1 • I make AI understandable

https://t.co/j75fAdLpp8

Joined August 2021

Don't wanna be here? Send us removal request.

Cameron R. Wolfe, Ph.D.

@cwolferesearch

7 months

Reinforcement Learning (RL) is quickly becoming the most important skill for AI researchers. Here are the best resources for learning RL for LLMs… TL;DR: RL is more important now than it has ever been, but (probably due to its complexity) there aren’t a ton of great resources

17

261

1K

Cameron R. Wolfe, Ph.D.

@cwolferesearch

5 days

The next AI Agents in Production conference is on November 18th. For those interested in the practical side of LLMs / agents, this is a good event to attend. Some highlights: - Completely free. - Everything can be viewed online. - Good talks from top companies (OAI, GDM, Meta,

1

2

13

Cameron R. Wolfe, Ph.D.

@cwolferesearch

7 days

For a full (and understandable) overview of PPO for LLMs, see my recent blog post: https://t.co/S5BSFjPGnt Here are the references from the post: [1] https://t.co/rOOElh9hI3 [2]

arxiv.org

Policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function...

0

1

7

Cameron R. Wolfe, Ph.D.

@cwolferesearch

7 days

Let’s implement Proximal Policy Optimization (PPO) together… Step #1 - Rollouts. We begin the PPO policy update with a batch of prompts. Using our current policy (i.e., the LLM we are training), we sample a single completion for each of these prompts. Additionally, we will

4

14

168

Victor Renard

@valent44355

21 hours

Nothing Special, just this... Vanguard: +595K shares. Forefront: +501K. Geode: +301K. UBS: +33K. Each bite-sized, each profitable. Collectively invested MILLIONS in $NXXT, casually turning into $18M. Wall Street eats in multiples. Retail eats in installments.

12

31

552

Cameron R. Wolfe, Ph.D.

@cwolferesearch

7 days

Couldn't be more excited for better interaction between X / substack. Check out my newsletter here:

Chris Best

@cjgbest

8 days

Update 2: even correcting for the fake views, traffic to Substack links from X is up substantially. (full post reads, signups, etc. also track.) We're so back!

1

0

7

Cameron R. Wolfe, Ph.D.

@cwolferesearch

8 days

The memory folding mechanism proposed in this paper is great. It makes sense that agents should spend time explicitly compressing their memory into a semantic / organized format to avoid context explosion. Worth mentioning though that memory compression / retention in agents

7

26

138

Cameron R. Wolfe, Ph.D.

@cwolferesearch

8 days

assistive coding tools definitely make me more productive, but the pattern isn't uniform. biggest productivity boost comes later in the day / at night when I'm mentally exhausted. LLMs lower the barrier to entry for getting extra work done. validating or iterating on code with an

1

0

19

Cameron R. Wolfe, Ph.D.

@cwolferesearch

11 days

The value of RL is very clearly / nicely articulated by DeepSeekMath… - RL enhances maj@k (majority vote), but not pass@k. - RL boosts the probability of correct completions that are already in top-k. - RL does NOT clearly enhance model capabilities.

9

14

156

Sairam Sundaresan

@DSaience

13 days

I can't believe I'm saying this - I'm officially a published author :D After three years, my first book is out. "AI for the Rest of Us" with @BloomsburyAcad is finally in the world. I wrote it because I watched too many people get left behind in AI conversations. The gap

8

6

60

Bloomsbury Academic

@BloomsburyAcad

13 days

"Through clever storytelling and illustration, [Sundaresan] brings technical concepts to life[.]" — Dr. Cameron R. Wolfe, Senior Research Scientist at Netflix (@cwolferesearch) Learn more: https://t.co/be2cECKogj @DSaience

0

2

6

Cameron R. Wolfe, Ph.D.

@cwolferesearch

16 days

For full details, check out my new blog on PPO that I just released this morning (see image).

1

0

3

Cameron R. Wolfe, Ph.D.

@cwolferesearch

16 days

Proximal Policy Optimization (PPO) is one of the most common (and complicated) RL algorithms used for LLMs. Here’s how it works… TRPO. PPO is inspired by TRPO, which uses a constrained objective that: 1. Normalizes action / token probabilities of current policy by those of an

7

63

339

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 month

For more details, see the recent overview in my RL series. It covers REINFORCE variants, which heavily utilize the bandit formulation.

0

1

2

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 month

There are two common RL training formulations for LLMs: Markov Decision Process (MDP) formulation and bandit formulation. Here’s how they work… Background: We should recall that an LLM generates output via next token prediction; i.e., by sequentially generating each output

1

4

14

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 month

For more, check out the blog I just wrote on simple online RL algorithms (like REINFORCE and RLOO) for LLMs. Info in image.

0

1

3

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 month

The complexity of PPO leads practitioners to avoid online RL in favor of RL-free or offline algorithms (e.g., DPO), but why not just use simpler versions of online RL? TL;DR: REINFORCE and RLOO have been shown to work well for training LLMs. And, they do not require a value

1

4

13

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 months

For full details, checkout my writeup on the online-offline RL performance gap (details in image). Here are citations to papers mentioned above: [1] Xu, Shusheng, et al. "Is dpo superior to ppo for llm alignment? a comprehensive study." (2024). [2] Tajwar, Fahim, et al.

0

2

7

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 months

Is there truly a gap in performance between online and offline RL training for LLMs? Here’s what the research says… TL;DR: There is a clear performance gap between online and offline RL algorithms, especially in large-scale LLM training. However, this gap can be minimized by

14

39

238

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 months

My newsletter, Deep (Learning) Focus, recently passed 50,000 subscribers. Here are my four favorite articles and some reflections on my journey with the newsletter… (1) Demystifying Reasoning Models outlines the key details of training reasoning-based LLMs, focusing on the

3

7

35

Cameron R. Wolfe, Ph.D.

@cwolferesearch

3 months

For full details, I just published a 12k word overview that exhaustively covers every aspect of GPT-oss starting from LLM first principles (see image).

2

3

24

Cameron R. Wolfe, Ph.D.

@cwolferesearch

3 months

GPT-oss provides a rare peek into LLM research at OpenAI. Here are all of the technical details that OpenAI shared about the models… (1) Model architecture. GPT-oss has two models in its family–20b and 120b. They both use a Mixture-of-Experts (MoE) architecture. The 120b (20b)

2

68

325