oswaldjoh Profile Banner
Johannes Oswald Profile
Johannes Oswald

@oswaldjoh

Followers
1K
Following
1K
Media
37
Statuses
233

Research Scientist, Google Research & ETH Zurich alumni

Zรผrich, Schweiz
Joined May 2017
Don't wanna be here? Send us removal request.
@oswaldjoh
Johannes Oswald
17 days
Super happy and proud to share our novel scalable RNN model - the MesaNet! . This work builds upon beautiful ideas of ๐—น๐—ผ๐—ฐ๐—ฎ๐—น๐—น๐˜† ๐—ผ๐—ฝ๐˜๐—ถ๐—บ๐—ฎ๐—น ๐˜๐—ฒ๐˜€๐˜-๐˜๐—ถ๐—บ๐—ฒ ๐˜๐—ฟ๐—ฎ๐—ถ๐—ป๐—ถ๐—ป๐—ด (TTT), and combines ideas of in-context learning, test-time training and mesa-optimization.
Tweet media one
4
64
388
@oswaldjoh
Johannes Oswald
10 days
Join us tomorrow, we are presenting the MesaNet the great ASAP seminar!.
@SonglinYang4
Songlin Yang
10 days
@oswaldjoh and @ninoscherrer will present MesaNet at the ASAP seminar on Tuesday, June 24 at 2 PM ET!. MesaNet is a locally optimal test-time training (TTT) layer that optimizes the key-value reconstruction objective over the entire history. If you're into TTT, don't miss it!
Tweet media one
0
1
33
@oswaldjoh
Johannes Oswald
17 days
1
1
13
@oswaldjoh
Johannes Oswald
17 days
Special shoutout to @mtavitschlegel and of course to my good friend and long term scientific hero and mentor Joรฃo Sacramento - who recently gave a talk about our work.
1
3
20
@oswaldjoh
Johannes Oswald
17 days
@OSieberling & @yaschimpf from the Swiss AGI Lab, @kaitlinmaile, @meulemansalex, Rif A. Saurous, @g_lajoie_, @C_Frenkel, Razvan Pascanu, and of course @blaiseaguera who assembled our fantastic ๐—ฃ๐—ฎ๐—ฟ๐—ฎ๐—ฑ๐—ถ๐—ด๐—บ๐˜€ ๐—ผ๐—ณ ๐—œ๐—ป๐˜๐—ฒ๐—น๐—น๐—ถ๐—ด๐—ฒ๐—ป๐—ฐ๐—ฒ Team at Google
Tweet media one
1
1
12
@oswaldjoh
Johannes Oswald
17 days
This has been a ๐—ฐ๐—ผ๐—น๐—น๐—ฎ๐—ฏ๐—ผ๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐˜„๐—ถ๐˜๐—ต ๐˜€๐—ผ ๐—บ๐—ฎ๐—ป๐˜† ๐—ณ๐—ฟ๐—ถ๐—ฒ๐—ป๐—ฑ๐˜€ ๐—ฎ๐—ป๐—ฑ ๐—ด๐—ฟ๐—ฒ๐—ฎ๐˜ ๐˜€๐—ฐ๐—ถ๐—ฒ๐—ป๐˜๐—ถ๐˜€๐˜๐˜€, first and foremost @ninoscherrer, and many others: the OGs @SeijinKobayashi, @LucaVersari3, @SonglinYang4 who wrote the fast triton kernel ๐Ÿ™๐Ÿ™๐Ÿ™,.
1
1
11
@oswaldjoh
Johannes Oswald
17 days
๐Ÿ“๐— ๐—ฒ๐˜€๐—ฎ๐—ก๐—ฒ๐˜ ๐—ฃ๐—ฎ๐—ฝ๐—ฒ๐—ฟ: ๐Ÿ“๐—ฃ๐—ฟ๐—ฒ๐—ฑ๐—ฒ๐—ฐ๐—ฒ๐˜€๐˜€๐—ผ๐—ฟ ๐—ฃ๐—ฎ๐—ฝ๐—ฒ๐—ฟ: โš™๏ธ๐—ง๐—ฟ๐—ถ๐˜๐—ผ๐—ป ๐—–๐—ผ๐—ฑ๐—ฒ: ๐Ÿ”ฌ๐Ÿงช๐—–๐—ผ๐—น๐—ฎ๐—ฏ ๐—ง๐˜‚๐˜๐—ผ๐—ฟ๐—ถ๐—ฎ๐—น:
1
2
16
@oswaldjoh
Johannes Oswald
17 days
โš ๏ธ ๐—ก๐˜‚๐—บ๐—ฒ๐—ฟ๐—ถ๐—ฐ๐—ฎ๐—น ๐—ฃ๐—ฟ๐—ฒ๐—ฐ๐—ถ๐˜€๐—ถ๐—ผ๐—ป: In our paper we use FP32 for activations (incl. related work), incl. mults used in the CG solver. Our Triton code is optimized for GPUs and uses FP16. This might introduce convergence issues and we are still investigating these.
2
1
14
@oswaldjoh
Johannes Oswald
17 days
So, while MesaNets offer a powerful way to memorize and learn in-context, they are compute intensive, and the gap to Transformers on global reasoning & recall still looks wide. โ‰๏ธShould we aim to close the gaps to softmax on these benchmarks or is this a dead end?.
1
1
11
@oswaldjoh
Johannes Oswald
17 days
๐——๐˜†๐—ป๐—ฎ๐—บ๐—ถ๐—ฐ ๐˜๐—ฒ๐˜€๐˜-๐˜๐—ถ๐—บ๐—ฒ ๐—ฐ๐—ผ๐—บ๐—ฝ๐˜‚๐˜๐—ฒ. Instead of running a fixed number of CG steps (red), we can stop computation dynamically by introducing an error threshold ฮต (blue). ๐Ÿ”Ž This stopping criterion naturally leads to more compute on avg. with increased sequence length
Tweet media one
1
1
10
@oswaldjoh
Johannes Oswald
17 days
Therefore, we split benchmarks into 2 groups (a) global and (b) local. This should reduce noise when reporting aggregated benchmark scores. ๐Ÿ”ŽMesaNets outperforms all other linear models on global reasoning and in-context recall benchmarks. However, a gap to Transformer remains.
Tweet media one
1
1
11
@oswaldjoh
Johannes Oswald
17 days
Given this finding, we trained sliding-window-attention (SWA) models to provide intuitive Transformer-like baselines with shorter context. ๐Ÿ”Žโš ๏ธ Intriguingly, we observe that SWAs with short windows (4-64) outperform transformers on various benchmarks. See Appendix for details!
Tweet media one
1
1
12
@oswaldjoh
Johannes Oswald
17 days
๐Ÿ“ ๐—Ÿ๐—ฎ๐—ป๐—ด๐˜‚๐—ฎ๐—ด๐—ฒ ๐—ช๐—ผ๐—ฟ๐—น๐—ฑ: We find that MesaNets are strong LMs. They outperform all RNNs and the Transformer baseline w.r.t. PPL benchmarks. ๐Ÿ”Ž We find that RNNs are quite different LMs, the they lower PPL ๐—ฒ๐—ฎ๐—ฟ๐—น๐˜†-๐—ถ๐—ป-๐˜๐—ต๐—ฒ-๐˜€๐—ฒ๐—พ๐˜‚๐—ฒ๐—ป๐—ฐ๐—ฒ but get worse later on.
Tweet media one
1
1
11
@oswaldjoh
Johannes Oswald
17 days
๐Ÿงช ๐—ฆ๐˜†๐—ป๐˜๐—ต๐—ฒ๐˜๐—ถ๐—ฐ ๐—•๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ๐˜€: MesaNet does great on synthetic benchmarks which have been shown to correlate with language modelling capabilities. First nice sign โ€“ now letโ€™s see how these results translate into the language domain!
Tweet media one
1
1
14
@oswaldjoh
Johannes Oswald
17 days
๐Ÿ“ˆIntriguingly, ๐— ๐—ฒ๐˜€๐—ฎ๐—ก๐—ฒ๐˜ ๐—ฑ๐˜†๐—ป๐—ฎ๐—บ๐—ถ๐—ฐ๐—ฎ๐—น๐—น๐˜† ๐—ฎ๐—น๐—น๐—ผ๐—ฐ๐—ฎ๐˜๐—ฒ๐˜€ ๐˜๐—ฒ๐˜€๐˜-๐˜๐—ถ๐—บ๐—ฒ ๐—ฐ๐—ผ๐—บ๐—ฝ๐˜‚๐˜๐—ฒ as the CG method has a stopping criterion. This opens interesting differences to softmax & classical RNNs which increase compute linear/constant with sequence length.
Tweet media one
1
1
22
@oswaldjoh
Johannes Oswald
17 days
โ‰๏ธBut how to compute this scary << linsolve >>? We propose to use ๐—ฐ๐—ผ๐—ป๐—ท๐˜‚๐—ด๐—ฎ๐˜๐—ฒ ๐—ด๐—ฟ๐—ฎ๐—ฑ๐—ถ๐—ฒ๐—ป๐˜๐˜€ (๐—–๐—š). This enables efficient parallelizable training as the compute heavy part of CG is ~GLA โ€” to the rescue for a fast implementation!
Tweet media one
1
2
14
@oswaldjoh
Johannes Oswald
17 days
. we modify the Mesa Layer [, an efficient local ๐˜๐—ฒ๐˜€๐˜-๐˜๐—ถ๐—บ๐—ฒ ๐—ผ๐—ฝ๐˜๐—ถ๐—บ๐—ถ๐˜‡๐—ฒr for linear models and the squared error loss! Here, we scale MesaNets to 1B by 1๏ธโƒฃ introducing parallelizable training, 2๏ธโƒฃ fixing stability issues when using forget gates.
Tweet media one
1
2
17
@oswaldjoh
Johannes Oswald
17 days
Letโ€™s appreciate the beautiful idea of local TTT! Each layer models its current seq. of inputs by ๐—บ๐—ถ๐—ป๐—ถ๐—บ๐—ถ๐˜‡๐—ถ๐—ป๐—ด ๐—ถ๐˜๐˜€ ๐—ผ๐˜„๐—ป ๐—น๐—ฎ๐˜†๐—ฒ๐—ฟ๐˜„๐—ถ๐˜€๐—ฒ ๐—ผ๐—ฏ๐—ท๐—ฒ๐—ฐ๐˜๐—ถ๐˜ƒ๐—ฒ function at test time. For now, efficient TTT of an objective dep. on the entire seq. was difficult, but. .
1
1
15
@oswaldjoh
Johannes Oswald
17 days
Softmax alternatives i.e. Mamba, xLSTM, GLA and DeltaNet can be motivated by a unifying framework of test-time training (: a linear model is learned online in-context. ๐ŸšจOur MesaNet takes this to the extreme of local optimality!
Tweet media one
1
2
30
@oswaldjoh
Johannes Oswald
28 days
RT @ninoscherrer: Excited to see Joรฃo Sacramento speaking at the @KempnerInst about our recent line of work on "locally optimal test-time tโ€ฆ.
0
4
0