Bryan Chan Profile
Bryan Chan

@chanpyb

Followers
185
Following
4K
Media
1
Statuses
50

PhD student @rlai_lab. Prev: @GoogleDeepMind, @OcadoTechnology, @kindredai, @UofTCompSci

Edmonton
Joined October 2020
Don't wanna be here? Send us removal request.
@chanpyb
Bryan Chan
6 months
I will be presenting this paper on how models trade-off in-context and in-weight learning at #ICLR2025 Drop by on Saturday and I’ll be happy to chat!
@chanpyb
Bryan Chan
9 months
Excited to share that our work on understanding when ICL emerges has been accepted to #ICLR2025 ! Submission for preview:
0
2
16
@TheOfficialACM
Association for Computing Machinery
8 months
Meet the recipients of the 2024 ACM A.M. Turing Award, Andrew G. Barto and Richard S. Sutton! They are recognized for developing the conceptual and algorithmic foundations of reinforcement learning. Please join us in congratulating the two recipients! https://t.co/GrDfgzW1fL
34
474
2K
@chanpyb
Bryan Chan
9 months
Excited to share that our work on understanding when ICL emerges has been accepted to #ICLR2025 ! Submission for preview:
openreview.net
It has recently been demonstrated empirically that in-context learning emerges in transformers when certain distributional properties are present in the training data, but this ability can also...
@chanpyb
Bryan Chan
1 year
LLMs can leverage context information, i.e., in-context learning (ICL) or memorize solutions, i.e., in-weight learning (IWL) for prediction, but when do they happen? 1/N
0
0
4
@chanpyb
Bryan Chan
11 months
Thanks @m_wulfmeier ! We were surprised to see that SAC-X is just very robust. Something that was interesting to us that we didn’t further investigate: Learning from examples ended up being more efficient than using reward. Let’s chat at #NeurIPS2024 if there’s a chance?
@m_wulfmeier
Markus Wulfmeier
11 months
Here's a fascinating paper by @domo_mr_roboto's group linking hierarchical reinforcement learning and cheaply-obtainable auxiliary tasks https://t.co/n7dC8ifUNr Better exploration with minimal engineering effort remains a critical challenge (even for RLHF/AIF) - reminiscent of
0
0
6
@mhmd_elsaye
Mohamed Elsayed
11 months
Would you believe that deep RL can work without replay buffers, target networks, or batch updates? Our recent work gets deep RL agents to learn from a continuous stream of data one sample at a time without storing any sample. Joint work with @Gautham529 and @rupammahmood.
9
106
629
@Gautham529
Gautham Vasan
11 months
Our NeurIPS paper is now on arXiv: We introduce Action Value Gradient (AVG), a novel incremental deep RL method that learns in real-time, one sample at a time — no batch updates, target networks or a replay buffer! Co-authors @mhmd_elsaye @bellingerc @white_martha @rupammahmood
2
23
94
@MontrealRobots
REAL - Robotics and Embodied AI Lab
11 months
Hey all! We are thrilled to have @chanpyb from @UAlberta for this week's seminar! The talk is titled: "Why can't we use reinforcement learning for image-based robotic manipulation?". See you at 11:30AM ET! https://t.co/vki05SSZgx #rl #manipulation, #imitationLearning
0
4
35
@chanpyb
Bryan Chan
1 year
@XinyiChen2 I think this line of work will lead us to a better understanding of how LLMs work and further lead us to new ideas in designing training algorithms for various LLMs. N/N Arxiv Link:
Tweet card summary image
arxiv.org
It has recently been demonstrated empirically that in-context learning emerges in transformers when certain distributional properties are present in the training data, but this ability can also...
0
0
1
@chanpyb
Bryan Chan
1 year
@XinyiChen2 Of course, we have also conducted experiments on a synthetic dataset, Omniglot, and fine-tuned a LLM with a small number of prompts to corroborate our theoretical findings. 7/N
1
0
1
@chanpyb
Bryan Chan
1 year
@XinyiChen2 In practice the models don’t know the test errors. To bridge this gap we provide a regret analysis, showing that training samples observed at every iteration can be seen as a test sample, so its loss can provide a quantity similar to that of the test error. 6/N
1
0
0
@chanpyb
Bryan Chan
1 year
@XinyiChen2 When we see each data point sufficiently enough times, the model will eventually perform IWL only, demonstrating the transience of ICL! This characterization also suggests that in some cases ICL is never transient because IWL is more erroneous compared to ICL. 5/N
1
0
0
@chanpyb
Bryan Chan
1 year
@XinyiChen2 With imbalanced datasets, we can expect the model to exhibit ICL on rare classes in the beginning while the model exhibits IWL on common classes, showing that a model can perform both ICL and IWL simultaneously. 4/N
1
0
0
@chanpyb
Bryan Chan
1 year
@XinyiChen2 Our result suggests that the model will perform ICL or IWL based on their corresponding test errors! In short, the model performs ICL for data points that appear rarely and are predictable from the context, while the IWL happens for data points that appear frequently. 3/N
1
0
0
@chanpyb
Bryan Chan
1 year
In this work with my collaborators @XinyiChen2, András György, and Dale Schuurmans, we provide a theory to characterize the emergence and transience of ICL through a simplified model. 2/N
1
0
1
@chanpyb
Bryan Chan
1 year
LLMs can leverage context information, i.e., in-context learning (ICL) or memorize solutions, i.e., in-weight learning (IWL) for prediction, but when do they happen? 1/N
1
0
1
@chanpyb
Bryan Chan
1 year
I think this line of work is very promising, many theoretical questions to answer. We also bypassed exploration for now with the demonstrations. Credits to my collaborators Anson Leung and @jabergT. N/N
0
0
1
@chanpyb
Bryan Chan
1 year
Finally, what's cool about this is that we actually compared these algorithms over multiple seeds, which many papers don't do when it comes to real-life robotic experiments! 7/N
1
0
1
@chanpyb
Bryan Chan
1 year
With this simple change, we can do hybrid RL with just 50 human demonstrations, and this agent can achieve around 75% with just 20 minutes of interaction time. With the same amount of data BC can't even achieve this performance! 6/N
1
0
0
@chanpyb
Bryan Chan
1 year
Well then, the regularizer is essentially decorrelating the latent representation, which also allows us to totally remove the target network because it was somewhat introduced to decorrelate consecutive state-action pairs 5/N
1
0
0
@chanpyb
Bryan Chan
1 year
This can actually be explained through the learning dynamics of Q-function with TD-learning. Previous works have looked at this through neural tangent kernel and found that the similarity of state-action pairs dictates the change in all Q-values after a SGD step 4/N
1
0
0