Sebastian Lee @sebalexlee X Profile

Sebastian Lee

@sebalexlee

Followers

76

Following

337

Media

4

Statuses

11

ML PhD Student

London, England

Joined July 2019

Don't wanna be here? Send us removal request.

Sebastian Lee

@sebalexlee

2 years

RT @nishpathead: Excited to present “Dynamics of high-dimensional policy learning in RL” at #ICLR2023 #physics4ml workshop tomorrow!. We ha….

0

7

0

Sebastian Lee

@sebalexlee

3 years

(N/N) We find that while replay performs well in orthogonal and aligned regimes, it fails in the intermediate similarity regime where it falls foul of the same dynamics underlying Maslow's hammer - just at shorter timescales, leading to a phenomenon we call catastrophic slowing.

0

1

Sebastian Lee

@sebalexlee

3 years

(9/N) In part II of our paper, we investigate how commonly used methods for combatting forgetting (EWC, a regularisation-based method by @ja_kirkpatrick et. al; and replay, inspired by CLS theory in the brain by McClelland and others.) fare on the spectrum of task similarity.

1

0

1

Sebastian Lee

@sebalexlee

3 years

(8/N) We show detailed evidence for this hypothesis in both the ODE limit of the teacher-student setup (a solvable limit where the input dimension is taken to infinity), and in an image task with FashionMNIST data.

1

0

1

Sebastian Lee

@sebalexlee

3 years

(7/N) For intermediate similarity, there is still benefit to re-using the specialised node for the second task. But unlike the aligned case, this interferes with the representation needed to continue performing well on the first task. This leads to forgetting!.

1

0

Sebastian Lee

@sebalexlee

3 years

(6/N) For orthogonal teachers, the student mirrors the hypothetical optimum and learns disjoint representations. Again this leads to little forgetting since the initially specialised node is untouched during training on the second task.

1

0

Sebastian Lee

@sebalexlee

3 years

(5/N) In practise SGD dynamics yield different solutions depending on the relationship between teachers. For highly aligned teachers, the student re-uses the specialised node for the second task. But this is fine since the representation needed for both tasks is highly similar.

1

0

Sebastian Lee

@sebalexlee

3 years

(4/N) Consider 2 two-layer teachers & 1 two-layer student (overparameterised wrt teacher). Assuming specialisation after learning teacher 1, minimal interference is in principle possible when training on teacher 2 if only the second disjoint sub-network of the student is used.

1

0

Sebastian Lee

@sebalexlee

3 years

(3/N) I learn a 🔨 is right for a nail. For a highly dissimilar object (e.g. a log), I am likely to seek a new tool/solution. However, a more similar object like a screw may erroneously tempt me into re-using the 🔨. We propose something similar is happening in neural networks. .

1

0

Sebastian Lee

@sebalexlee

3 years

(2/N) "It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail" - Abraham Maslow, 1966. We borrow intuition from this aphorism to explain recent findings that intermediate task similarity is worst for catastrophic forgetting.

1

0

Sebastian Lee

@sebalexlee

3 years

(1/N) I had the pleasure of presenting recent work on continual learning with @stefesseM @ClopathLab @sebastiangoldt @SaxeLab at #LesHouches2022, before Stefano takes it on the next leg of the tour @icmlconf next week. Paper 📰 ICML Poster #⃣ 1434

1

8

40