Yunhao (Robin) Tang @robinphysics X Profile

Yunhao (Robin) Tang

@robinphysics

Followers

1K

Following

1K

Media

25

Statuses

128

Interested in RL. Science @MistralAI. Prev Llama post-training @AIatMeta, Gemini post-training and deep RL research @Deepmind, PhD @Columbia

Joined November 2018

Don't wanna be here? Send us removal request.

Yunhao (Robin) Tang

@robinphysics

1 year

Online interaction is probably a defining property of RL. But with the rise of offline algo, it is not clear if the “online” bit of RL is necessary for RLHF. We hypothesis test the causes of the perf gap between online and offline alignment. Details in🧵

3

16

71

Yunhao (Robin) Tang

@robinphysics

24 days

Taking the k3 estimate as an example (from John's popular blogpost . Contrary to popular practice, differentiating the estimate as a loss ends up enforcing the reverse-KL, but only incidentally. See more details:

0

5

30

Yunhao (Robin) Tang

@robinphysics

24 days

Maybe to one's surprise, taking KL estimates as `kl_loss` to minimize does *not* enforce the KL. This implementation, however, is quite common in open source RL repos and recent research papers. In short: grad of an unbiased KL estimate is not an unbiased estimate of KL grad.

14

52

654

Yunhao (Robin) Tang

@robinphysics

26 days

RT @MistralAI: Announcing Magistral, our first reasoning model designed to excel in domain-specific, transparent, and multilingual reasonin….

0

454

0

Yunhao (Robin) Tang

@robinphysics

30 days

It was refreshing to see the impact that small algorithmic changes have on the system performance. While the “double-sided” PPO/GRPO clipping is dominant in the literature, we argue that a single-sided clipping akin to IMPALA fits the design of distributed training more.

0

12

Yunhao (Robin) Tang

@robinphysics

30 days

Introducing LlamaRL, a distributed RL framework for training LLM at scale. LlamaRL is highly modular, Pytorch-native, customizes optimization of actors/learners to max out throughput, and adjusts for systemic off-policyness to stabilize training.

4

50

300

Yunhao (Robin) Tang

@robinphysics

1 year

RT @ZacKenton1: Eventually, humans will need to supervise superhuman AI - but how? Can we study it now?. We don't have superhuman AI, but w….

0

60

0

Yunhao (Robin) Tang

@robinphysics

1 year

Thanks @_akhaliq for promoting our work!. Unlike regular RL where golden r(s,a) are available and online is generally deemed better than offline, in RLHF this is less clear. Complementary to some concurrent work, we investigate causes to the perf gap between online vs. offline.

AK

@_akhaliq

1 year

Understanding the performance gap between online and offline alignment algorithms. Reinforcement learning from human feedback (RLHF) is the canonical framework for large language model alignment. However, rising popularity in offline alignment algorithms challenge the need

0

4

16

Yunhao (Robin) Tang

@robinphysics

1 year

The findings ought to be taken with a grain of salt due to limitations in our experimental setups. But hopefully this investigation contributes to a better understanding of RLHF practices. Finally, very grateful to my collaborators @GoogleDeepMind on this fun project!.

1

0

6

Yunhao (Robin) Tang

@robinphysics

1 year

Some takeaways:.- There is something more to online than wider coverage of response generation.- Offline training improves policy is a much more implicit way than online (discriminative vs. generative abilities).- The gap persists across wider variants of algos and network sizes.

1

0

4

Yunhao (Robin) Tang

@robinphysics

1 year

Hypothesis 5: Scaling policy network size is all you need! So that offline benefits more from scaling. Unfortunately, scaling policy networks alone does not seem to suffice, as the online vs. offline gap can actually increase as we scale up.

1

0

3

Yunhao (Robin) Tang

@robinphysics

1 year

Hypothesis 4: Maybe much of the gap is due to interactions between offline and discriminative loss, rather than offline itself. We evaluate the same setting with best-of-2, which uses a SFT-like rather than discriminative loss. Online vs. offline gap still persists in some cases

1

0

3

Yunhao (Robin) Tang

@robinphysics

1 year

Hypothesis 3: offline training learns a RM. So maybe improving offline policy’s accuracy as a RM would improve performance. We found there is little positive correlation between generative and discriminative abilities of policies both across and within different algos.

1

0

3

Yunhao (Robin) Tang

@robinphysics

1 year

Hypothesis 2: offline dataset is of low quality. Relating offline algo to SFT, maybe offline is more susceptible to dataset quality. We generated “high-quality” data from a highly competent policy and ran offline algo over. Surprisingly, offline did not learn much in this case.

1

0

3

Yunhao (Robin) Tang

@robinphysics

1 year

Hypothesis 1: online has wider response coverage than offline. Online sees more diverse responses than offline, as the latter uses a static dataset. To verify this, we saved the online (prompt, responses) as a dataset and ran offline algo over it. This did not explain the gap.

1

0

3

Yunhao (Robin) Tang

@robinphysics

1 year

To study causes of the gap, we need to establish one. We compare online vs. offline IPO, so that they only differ in sampling distribution. In what sense is online better than offline? Online achieves better perf vs. KL trade-off than offline. Both are subject to Goodhart’s law.

1

0

3

Yunhao (Robin) Tang

@robinphysics

2 years

RT @misovalko: Fast-forward ⏩ alignment research from @GoogleDeepMind ! Our latest results enhance alignment outcomes in Large Language Mod….

0

129

0

Yunhao (Robin) Tang

@robinphysics

2 years

Interested in how . **non-contrastive representation learning for RL**. is magically equivalent to. **gradient-based PCA/SVD on the transition matrix**. and hence won't collapse and capture spectral info about the transition? . Come talk to us at #ICML2023 Hall 1 #308 at 1:30pm.

Yunhao (Robin) Tang

@robinphysics

2 years

Interested in how non-contrastive representation learning works in RL? We show.(1) Why representations do not collapses.(2) How it relates to gradient PCA / SVD of transition matrix.Understanding Self-Predictive Learning for RL #ICML2023 @GoogleDeepMind

0

4

50

Yunhao (Robin) Tang

@robinphysics

2 years

RT @wwdabney: Even if all you want is a value function, using quantile TD (QTD) can give a better estimate than standard TD. Today at #ICM….

0

3

0

Yunhao (Robin) Tang

@robinphysics

2 years

Interested in how non-contrastive representation learning works in RL? We show.(1) Why representations do not collapses.(2) How it relates to gradient PCA / SVD of transition matrix.Understanding Self-Predictive Learning for RL #ICML2023 @GoogleDeepMind

1

56

160