
Johannes Oswald
@oswaldjoh
Followers
1K
Following
1K
Media
37
Statuses
233
Research Scientist, Google Research & ETH Zurich alumni
Zรผrich, Schweiz
Joined May 2017
Super happy and proud to share our novel scalable RNN model - the MesaNet! . This work builds upon beautiful ideas of ๐น๐ผ๐ฐ๐ฎ๐น๐น๐ ๐ผ๐ฝ๐๐ถ๐บ๐ฎ๐น ๐๐ฒ๐๐-๐๐ถ๐บ๐ฒ ๐๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด (TTT), and combines ideas of in-context learning, test-time training and mesa-optimization.
4
64
388
Join us tomorrow, we are presenting the MesaNet the great ASAP seminar!.
@oswaldjoh and @ninoscherrer will present MesaNet at the ASAP seminar on Tuesday, June 24 at 2 PM ET!. MesaNet is a locally optimal test-time training (TTT) layer that optimizes the key-value reconstruction objective over the entire history. If you're into TTT, don't miss it!
0
1
33
Tagging some people who may be interested in this work:.@yoonrkim.@RogerGrosse.@tri_dao.@_albertgu.@SchmidhuberAI.@srush_nlp.@ImanolSchlag.@heyyalexwang.@GarneloMarta.@scychan_brains.@AndrewLampinen.
1
1
13
Special shoutout to @mtavitschlegel and of course to my good friend and long term scientific hero and mentor Joรฃo Sacramento - who recently gave a talk about our work.
1
3
20
@OSieberling & @yaschimpf from the Swiss AGI Lab, @kaitlinmaile, @meulemansalex, Rif A. Saurous, @g_lajoie_, @C_Frenkel, Razvan Pascanu, and of course @blaiseaguera who assembled our fantastic ๐ฃ๐ฎ๐ฟ๐ฎ๐ฑ๐ถ๐ด๐บ๐ ๐ผ๐ณ ๐๐ป๐๐ฒ๐น๐น๐ถ๐ด๐ฒ๐ป๐ฐ๐ฒ Team at Google
1
1
12
This has been a ๐ฐ๐ผ๐น๐น๐ฎ๐ฏ๐ผ๐ฟ๐ฎ๐๐ถ๐ผ๐ป ๐๐ถ๐๐ต ๐๐ผ ๐บ๐ฎ๐ป๐ ๐ณ๐ฟ๐ถ๐ฒ๐ป๐ฑ๐ ๐ฎ๐ป๐ฑ ๐ด๐ฟ๐ฒ๐ฎ๐ ๐๐ฐ๐ถ๐ฒ๐ป๐๐ถ๐๐๐, first and foremost @ninoscherrer, and many others: the OGs @SeijinKobayashi, @LucaVersari3, @SonglinYang4 who wrote the fast triton kernel ๐๐๐,.
1
1
11
โ ๏ธ ๐ก๐๐บ๐ฒ๐ฟ๐ถ๐ฐ๐ฎ๐น ๐ฃ๐ฟ๐ฒ๐ฐ๐ถ๐๐ถ๐ผ๐ป: In our paper we use FP32 for activations (incl. related work), incl. mults used in the CG solver. Our Triton code is optimized for GPUs and uses FP16. This might introduce convergence issues and we are still investigating these.
2
1
14
๐๐๐ป๐ฎ๐บ๐ถ๐ฐ ๐๐ฒ๐๐-๐๐ถ๐บ๐ฒ ๐ฐ๐ผ๐บ๐ฝ๐๐๐ฒ. Instead of running a fixed number of CG steps (red), we can stop computation dynamically by introducing an error threshold ฮต (blue). ๐ This stopping criterion naturally leads to more compute on avg. with increased sequence length
1
1
10
๐ ๐๐ฎ๐ป๐ด๐๐ฎ๐ด๐ฒ ๐ช๐ผ๐ฟ๐น๐ฑ: We find that MesaNets are strong LMs. They outperform all RNNs and the Transformer baseline w.r.t. PPL benchmarks. ๐ We find that RNNs are quite different LMs, the they lower PPL ๐ฒ๐ฎ๐ฟ๐น๐-๐ถ๐ป-๐๐ต๐ฒ-๐๐ฒ๐พ๐๐ฒ๐ป๐ฐ๐ฒ but get worse later on.
1
1
11
๐Intriguingly, ๐ ๐ฒ๐๐ฎ๐ก๐ฒ๐ ๐ฑ๐๐ป๐ฎ๐บ๐ถ๐ฐ๐ฎ๐น๐น๐ ๐ฎ๐น๐น๐ผ๐ฐ๐ฎ๐๐ฒ๐ ๐๐ฒ๐๐-๐๐ถ๐บ๐ฒ ๐ฐ๐ผ๐บ๐ฝ๐๐๐ฒ as the CG method has a stopping criterion. This opens interesting differences to softmax & classical RNNs which increase compute linear/constant with sequence length.
1
1
22
. we modify the Mesa Layer [, an efficient local ๐๐ฒ๐๐-๐๐ถ๐บ๐ฒ ๐ผ๐ฝ๐๐ถ๐บ๐ถ๐๐ฒr for linear models and the squared error loss! Here, we scale MesaNets to 1B by 1๏ธโฃ introducing parallelizable training, 2๏ธโฃ fixing stability issues when using forget gates.
1
2
17
Letโs appreciate the beautiful idea of local TTT! Each layer models its current seq. of inputs by ๐บ๐ถ๐ป๐ถ๐บ๐ถ๐๐ถ๐ป๐ด ๐ถ๐๐ ๐ผ๐๐ป ๐น๐ฎ๐๐ฒ๐ฟ๐๐ถ๐๐ฒ ๐ผ๐ฏ๐ท๐ฒ๐ฐ๐๐ถ๐๐ฒ function at test time. For now, efficient TTT of an objective dep. on the entire seq. was difficult, but. .
1
1
15
RT @ninoscherrer: Excited to see Joรฃo Sacramento speaking at the @KempnerInst about our recent line of work on "locally optimal test-time tโฆ.
0
4
0