Bryan Chan @chanpyb X Profile

Bryan Chan

@chanpyb

Followers

185

Following

4K

Media

1

Statuses

50

PhD student @rlai_lab. Prev: @GoogleDeepMind, @OcadoTechnology, @kindredai, @UofTCompSci

https://t.co/V3gKAY4BDt

Edmonton

Joined October 2020

Don't wanna be here? Send us removal request.

Bryan Chan

@chanpyb

6 months

I will be presenting this paper on how models trade-off in-context and in-weight learning at #ICLR2025 Drop by on Saturday and I’ll be happy to chat!

Bryan Chan

@chanpyb

9 months

Excited to share that our work on understanding when ICL emerges has been accepted to #ICLR2025 ! Submission for preview:

0

2

16

Association for Computing Machinery

@TheOfficialACM

8 months

Meet the recipients of the 2024 ACM A.M. Turing Award, Andrew G. Barto and Richard S. Sutton! They are recognized for developing the conceptual and algorithmic foundations of reinforcement learning. Please join us in congratulating the two recipients! https://t.co/GrDfgzW1fL

34

474

2K

Bryan Chan

@chanpyb

9 months

Excited to share that our work on understanding when ICL emerges has been accepted to #ICLR2025 ! Submission for preview:

openreview.net

It has recently been demonstrated empirically that in-context learning emerges in transformers when certain distributional properties are present in the training data, but this ability can also...

Bryan Chan

@chanpyb

1 year

LLMs can leverage context information, i.e., in-context learning (ICL) or memorize solutions, i.e., in-weight learning (IWL) for prediction, but when do they happen? 1/N

0

4

Bryan Chan

@chanpyb

11 months

Thanks @m_wulfmeier ! We were surprised to see that SAC-X is just very robust. Something that was interesting to us that we didn’t further investigate: Learning from examples ended up being more efficient than using reward. Let’s chat at #NeurIPS2024 if there’s a chance?

Markus Wulfmeier

@m_wulfmeier

11 months

Here's a fascinating paper by @domo_mr_roboto's group linking hierarchical reinforcement learning and cheaply-obtainable auxiliary tasks https://t.co/n7dC8ifUNr Better exploration with minimal engineering effort remains a critical challenge (even for RLHF/AIF) - reminiscent of

0

6

Mohamed Elsayed

@mhmd_elsaye

11 months

Would you believe that deep RL can work without replay buffers, target networks, or batch updates? Our recent work gets deep RL agents to learn from a continuous stream of data one sample at a time without storing any sample. Joint work with @Gautham529 and @rupammahmood.

9

106

629

Gautham Vasan

@Gautham529

11 months

Our NeurIPS paper is now on arXiv: We introduce Action Value Gradient (AVG), a novel incremental deep RL method that learns in real-time, one sample at a time — no batch updates, target networks or a replay buffer! Co-authors @mhmd_elsaye @bellingerc @white_martha @rupammahmood

2

23

94

REAL - Robotics and Embodied AI Lab

@MontrealRobots

11 months

Hey all! We are thrilled to have @chanpyb from @UAlberta for this week's seminar! The talk is titled: "Why can't we use reinforcement learning for image-based robotic manipulation?". See you at 11:30AM ET! https://t.co/vki05SSZgx #rl #manipulation, #imitationLearning

0

4

35

Bryan Chan

@chanpyb

1 year

@XinyiChen2 I think this line of work will lead us to a better understanding of how LLMs work and further lead us to new ideas in designing training algorithms for various LLMs. N/N Arxiv Link:

arxiv.org

It has recently been demonstrated empirically that in-context learning emerges in transformers when certain distributional properties are present in the training data, but this ability can also...

0

1

Bryan Chan

@chanpyb

1 year

@XinyiChen2 Of course, we have also conducted experiments on a synthetic dataset, Omniglot, and fine-tuned a LLM with a small number of prompts to corroborate our theoretical findings. 7/N

1

0

1

Bryan Chan

@chanpyb

1 year

@XinyiChen2 In practice the models don’t know the test errors. To bridge this gap we provide a regret analysis, showing that training samples observed at every iteration can be seen as a test sample, so its loss can provide a quantity similar to that of the test error. 6/N

1

0

Bryan Chan

@chanpyb

1 year

@XinyiChen2 When we see each data point sufficiently enough times, the model will eventually perform IWL only, demonstrating the transience of ICL! This characterization also suggests that in some cases ICL is never transient because IWL is more erroneous compared to ICL. 5/N

1

0

Bryan Chan

@chanpyb

1 year

@XinyiChen2 With imbalanced datasets, we can expect the model to exhibit ICL on rare classes in the beginning while the model exhibits IWL on common classes, showing that a model can perform both ICL and IWL simultaneously. 4/N

1

0

Bryan Chan

@chanpyb

1 year

@XinyiChen2 Our result suggests that the model will perform ICL or IWL based on their corresponding test errors! In short, the model performs ICL for data points that appear rarely and are predictable from the context, while the IWL happens for data points that appear frequently. 3/N

1

0

Bryan Chan

@chanpyb

1 year

In this work with my collaborators @XinyiChen2, András György, and Dale Schuurmans, we provide a theory to characterize the emergence and transience of ICL through a simplified model. 2/N

1

0

1

Bryan Chan

@chanpyb

1 year

LLMs can leverage context information, i.e., in-context learning (ICL) or memorize solutions, i.e., in-weight learning (IWL) for prediction, but when do they happen? 1/N

1

0

1

Bryan Chan

@chanpyb

1 year

I think this line of work is very promising, many theoretical questions to answer. We also bypassed exploration for now with the demonstrations. Credits to my collaborators Anson Leung and @jabergT. N/N

0

1

Bryan Chan

@chanpyb

1 year

Finally, what's cool about this is that we actually compared these algorithms over multiple seeds, which many papers don't do when it comes to real-life robotic experiments! 7/N

1

0

1

Bryan Chan

@chanpyb

1 year

With this simple change, we can do hybrid RL with just 50 human demonstrations, and this agent can achieve around 75% with just 20 minutes of interaction time. With the same amount of data BC can't even achieve this performance! 6/N

1

0

Bryan Chan

@chanpyb

1 year

Well then, the regularizer is essentially decorrelating the latent representation, which also allows us to totally remove the target network because it was somewhat introduced to decorrelate consecutive state-action pairs 5/N

1

0

Bryan Chan

@chanpyb

1 year

This can actually be explained through the learning dynamics of Q-function with TD-learning. Previous works have looked at this through neural tangent kernel and found that the similarity of state-action pairs dictates the change in all Q-values after a SGD step 4/N

1

0