Sebastian Lee Profile
Sebastian Lee

@sebalexlee

Followers
76
Following
337
Media
4
Statuses
11

ML PhD Student

London, England
Joined July 2019
Don't wanna be here? Send us removal request.
@sebalexlee
Sebastian Lee
2 years
RT @nishpathead: Excited to present “Dynamics of high-dimensional policy learning in RL” at #ICLR2023 #physics4ml workshop tomorrow!. We ha….
0
7
0
@sebalexlee
Sebastian Lee
3 years
(N/N) We find that while replay performs well in orthogonal and aligned regimes, it fails in the intermediate similarity regime where it falls foul of the same dynamics underlying Maslow's hammer - just at shorter timescales, leading to a phenomenon we call catastrophic slowing.
Tweet media one
0
0
1
@sebalexlee
Sebastian Lee
3 years
(9/N) In part II of our paper, we investigate how commonly used methods for combatting forgetting (EWC, a regularisation-based method by @ja_kirkpatrick et. al; and replay, inspired by CLS theory in the brain by McClelland and others.) fare on the spectrum of task similarity.
1
0
1
@sebalexlee
Sebastian Lee
3 years
(8/N) We show detailed evidence for this hypothesis in both the ODE limit of the teacher-student setup (a solvable limit where the input dimension is taken to infinity), and in an image task with FashionMNIST data.
1
0
1
@sebalexlee
Sebastian Lee
3 years
(7/N) For intermediate similarity, there is still benefit to re-using the specialised node for the second task. But unlike the aligned case, this interferes with the representation needed to continue performing well on the first task. This leads to forgetting!.
1
0
0
@sebalexlee
Sebastian Lee
3 years
(6/N) For orthogonal teachers, the student mirrors the hypothetical optimum and learns disjoint representations. Again this leads to little forgetting since the initially specialised node is untouched during training on the second task.
1
0
0
@sebalexlee
Sebastian Lee
3 years
(5/N) In practise SGD dynamics yield different solutions depending on the relationship between teachers. For highly aligned teachers, the student re-uses the specialised node for the second task. But this is fine since the representation needed for both tasks is highly similar.
Tweet media one
1
0
0
@sebalexlee
Sebastian Lee
3 years
(4/N) Consider 2 two-layer teachers & 1 two-layer student (overparameterised wrt teacher). Assuming specialisation after learning teacher 1, minimal interference is in principle possible when training on teacher 2 if only the second disjoint sub-network of the student is used.
Tweet media one
1
0
0
@sebalexlee
Sebastian Lee
3 years
(3/N) I learn a 🔨 is right for a nail. For a highly dissimilar object (e.g. a log), I am likely to seek a new tool/solution. However, a more similar object like a screw may erroneously tempt me into re-using the 🔨. We propose something similar is happening in neural networks. .
1
0
0
@sebalexlee
Sebastian Lee
3 years
(2/N) "It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail" - Abraham Maslow, 1966. We borrow intuition from this aphorism to explain recent findings that intermediate task similarity is worst for catastrophic forgetting.
1
0
0
@sebalexlee
Sebastian Lee
3 years
(1/N) I had the pleasure of presenting recent work on continual learning with @stefesseM @ClopathLab @sebastiangoldt @SaxeLab at #LesHouches2022, before Stefano takes it on the next leg of the tour @icmlconf next week. Paper 📰 ICML Poster #⃣ 1434
Tweet media one
Tweet media two
1
8
40