Yunhao (Robin) Tang Profile
Yunhao (Robin) Tang

@robinphysics

Followers
1K
Following
1K
Media
25
Statuses
128

Interested in RL. Science @MistralAI. Prev Llama post-training @AIatMeta, Gemini post-training and deep RL research @Deepmind, PhD @Columbia

Joined November 2018
Don't wanna be here? Send us removal request.
@robinphysics
Yunhao (Robin) Tang
1 year
Online interaction is probably a defining property of RL. But with the rise of offline algo, it is not clear if the “online” bit of RL is necessary for RLHF. We hypothesis test the causes of the perf gap between online and offline alignment. Details in🧵
Tweet media one
3
16
71
@robinphysics
Yunhao (Robin) Tang
24 days
Taking the k3 estimate as an example (from John's popular blogpost . Contrary to popular practice, differentiating the estimate as a loss ends up enforcing the reverse-KL, but only incidentally. See more details:
0
5
30
@robinphysics
Yunhao (Robin) Tang
24 days
Maybe to one's surprise, taking KL estimates as `kl_loss` to minimize does *not* enforce the KL. This implementation, however, is quite common in open source RL repos and recent research papers. In short: grad of an unbiased KL estimate is not an unbiased estimate of KL grad.
Tweet media one
14
52
654
@robinphysics
Yunhao (Robin) Tang
26 days
RT @MistralAI: Announcing Magistral, our first reasoning model designed to excel in domain-specific, transparent, and multilingual reasonin….
0
454
0
@robinphysics
Yunhao (Robin) Tang
30 days
It was refreshing to see the impact that small algorithmic changes have on the system performance. While the “double-sided” PPO/GRPO clipping is dominant in the literature, we argue that a single-sided clipping akin to IMPALA fits the design of distributed training more.
Tweet media one
0
0
12
@robinphysics
Yunhao (Robin) Tang
30 days
Introducing LlamaRL, a distributed RL framework for training LLM at scale. LlamaRL is highly modular, Pytorch-native, customizes optimization of actors/learners to max out throughput, and adjusts for systemic off-policyness to stabilize training.
Tweet media one
4
50
300
@robinphysics
Yunhao (Robin) Tang
1 year
RT @ZacKenton1: Eventually, humans will need to supervise superhuman AI - but how? Can we study it now?. We don't have superhuman AI, but w….
0
60
0
@robinphysics
Yunhao (Robin) Tang
1 year
Thanks @_akhaliq for promoting our work!. Unlike regular RL where golden r(s,a) are available and online is generally deemed better than offline, in RLHF this is less clear. Complementary to some concurrent work, we investigate causes to the perf gap between online vs. offline.
@_akhaliq
AK
1 year
Understanding the performance gap between online and offline alignment algorithms. Reinforcement learning from human feedback (RLHF) is the canonical framework for large language model alignment. However, rising popularity in offline alignment algorithms challenge the need
Tweet media one
0
4
16
@robinphysics
Yunhao (Robin) Tang
1 year
The findings ought to be taken with a grain of salt due to limitations in our experimental setups. But hopefully this investigation contributes to a better understanding of RLHF practices. Finally, very grateful to my collaborators @GoogleDeepMind on this fun project!.
1
0
6
@robinphysics
Yunhao (Robin) Tang
1 year
Some takeaways:.- There is something more to online than wider coverage of response generation.- Offline training improves policy is a much more implicit way than online (discriminative vs. generative abilities).- The gap persists across wider variants of algos and network sizes.
1
0
4
@robinphysics
Yunhao (Robin) Tang
1 year
Hypothesis 5: Scaling policy network size is all you need! So that offline benefits more from scaling. Unfortunately, scaling policy networks alone does not seem to suffice, as the online vs. offline gap can actually increase as we scale up.
Tweet media one
1
0
3
@robinphysics
Yunhao (Robin) Tang
1 year
Hypothesis 4: Maybe much of the gap is due to interactions between offline and discriminative loss, rather than offline itself. We evaluate the same setting with best-of-2, which uses a SFT-like rather than discriminative loss. Online vs. offline gap still persists in some cases
Tweet media one
1
0
3
@robinphysics
Yunhao (Robin) Tang
1 year
Hypothesis 3: offline training learns a RM. So maybe improving offline policy’s accuracy as a RM would improve performance. We found there is little positive correlation between generative and discriminative abilities of policies both across and within different algos.
Tweet media one
1
0
3
@robinphysics
Yunhao (Robin) Tang
1 year
Hypothesis 2: offline dataset is of low quality. Relating offline algo to SFT, maybe offline is more susceptible to dataset quality. We generated “high-quality” data from a highly competent policy and ran offline algo over. Surprisingly, offline did not learn much in this case.
Tweet media one
1
0
3
@robinphysics
Yunhao (Robin) Tang
1 year
Hypothesis 1: online has wider response coverage than offline. Online sees more diverse responses than offline, as the latter uses a static dataset. To verify this, we saved the online (prompt, responses) as a dataset and ran offline algo over it. This did not explain the gap.
Tweet media one
1
0
3
@robinphysics
Yunhao (Robin) Tang
1 year
To study causes of the gap, we need to establish one. We compare online vs. offline IPO, so that they only differ in sampling distribution. In what sense is online better than offline? Online achieves better perf vs. KL trade-off than offline. Both are subject to Goodhart’s law.
Tweet media one
1
0
3
@robinphysics
Yunhao (Robin) Tang
2 years
RT @misovalko: Fast-forward ⏩ alignment research from @GoogleDeepMind ! Our latest results enhance alignment outcomes in Large Language Mod….
0
129
0
@robinphysics
Yunhao (Robin) Tang
2 years
Interested in how . **non-contrastive representation learning for RL**. is magically equivalent to. **gradient-based PCA/SVD on the transition matrix**. and hence won't collapse and capture spectral info about the transition? . Come talk to us at #ICML2023 Hall 1 #308 at 1:30pm.
@robinphysics
Yunhao (Robin) Tang
2 years
Interested in how non-contrastive representation learning works in RL? We show.(1) Why representations do not collapses.(2) How it relates to gradient PCA / SVD of transition matrix.Understanding Self-Predictive Learning for RL #ICML2023 @GoogleDeepMind
Tweet media one
0
4
50
@robinphysics
Yunhao (Robin) Tang
2 years
RT @wwdabney: Even if all you want is a value function, using quantile TD (QTD) can give a better estimate than standard TD. Today at #ICM….
0
3
0
@robinphysics
Yunhao (Robin) Tang
2 years
Interested in how non-contrastive representation learning works in RL? We show.(1) Why representations do not collapses.(2) How it relates to gradient PCA / SVD of transition matrix.Understanding Self-Predictive Learning for RL #ICML2023 @GoogleDeepMind
Tweet media one
1
56
160