Johannes Oswald @oswaldjoh X Profile

Johannes Oswald

@oswaldjoh

Followers

1K

Following

1K

Media

37

Statuses

233

Research Scientist, Google Research & ETH Zurich alumni

Zürich, Schweiz

Joined May 2017

Don't wanna be here? Send us removal request.

Johannes Oswald

@oswaldjoh

17 days

Super happy and proud to share our novel scalable RNN model - the MesaNet! . This work builds upon beautiful ideas of 𝗹𝗼𝗰𝗮𝗹𝗹𝘆 𝗼𝗽𝘁𝗶𝗺𝗮𝗹 𝘁𝗲𝘀𝘁-𝘁𝗶𝗺𝗲 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 (TTT), and combines ideas of in-context learning, test-time training and mesa-optimization.

4

64

388

Johannes Oswald

@oswaldjoh

10 days

Join us tomorrow, we are presenting the MesaNet the great ASAP seminar!.

Songlin Yang

@SonglinYang4

10 days

@oswaldjoh and @ninoscherrer will present MesaNet at the ASAP seminar on Tuesday, June 24 at 2 PM ET!. MesaNet is a locally optimal test-time training (TTT) layer that optimizes the key-value reconstruction objective over the entire history. If you're into TTT, don't miss it!

0

1

33

Johannes Oswald

@oswaldjoh

17 days

Tagging some people who may be interested in this work:.@yoonrkim.@RogerGrosse.@tri_dao.@_albertgu.@SchmidhuberAI.@srush_nlp.@ImanolSchlag.@heyyalexwang.@GarneloMarta.@scychan_brains.@AndrewLampinen.

1

13

Johannes Oswald

@oswaldjoh

17 days

Special shoutout to @mtavitschlegel and of course to my good friend and long term scientific hero and mentor João Sacramento - who recently gave a talk about our work.

1

3

20

Johannes Oswald

@oswaldjoh

17 days

@OSieberling & @yaschimpf from the Swiss AGI Lab, @kaitlinmaile, @meulemansalex, Rif A. Saurous, @g_lajoie_, @C_Frenkel, Razvan Pascanu, and of course @blaiseaguera who assembled our fantastic 𝗣𝗮𝗿𝗮𝗱𝗶𝗴𝗺𝘀 𝗼𝗳 𝗜𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝗰𝗲 Team at Google

1

12

Johannes Oswald

@oswaldjoh

17 days

This has been a 𝗰𝗼𝗹𝗹𝗮𝗯𝗼𝗿𝗮𝘁𝗶𝗼𝗻 𝘄𝗶𝘁𝗵 𝘀𝗼 𝗺𝗮𝗻𝘆 𝗳𝗿𝗶𝗲𝗻𝗱𝘀 𝗮𝗻𝗱 𝗴𝗿𝗲𝗮𝘁 𝘀𝗰𝗶𝗲𝗻𝘁𝗶𝘀𝘁𝘀, first and foremost @ninoscherrer, and many others: the OGs @SeijinKobayashi, @LucaVersari3, @SonglinYang4 who wrote the fast triton kernel 🙏🙏🙏,.

1

11

Johannes Oswald

@oswaldjoh

17 days

📝𝗠𝗲𝘀𝗮𝗡𝗲𝘁 𝗣𝗮𝗽𝗲𝗿: 📝𝗣𝗿𝗲𝗱𝗲𝗰𝗲𝘀𝘀𝗼𝗿 𝗣𝗮𝗽𝗲𝗿: ⚙️𝗧𝗿𝗶𝘁𝗼𝗻 𝗖𝗼𝗱𝗲: 🔬🧪𝗖𝗼𝗹𝗮𝗯 𝗧𝘂𝘁𝗼𝗿𝗶𝗮𝗹:

1

2

16

Johannes Oswald

@oswaldjoh

17 days

⚠️ 𝗡𝘂𝗺𝗲𝗿𝗶𝗰𝗮𝗹 𝗣𝗿𝗲𝗰𝗶𝘀𝗶𝗼𝗻: In our paper we use FP32 for activations (incl. related work), incl. mults used in the CG solver. Our Triton code is optimized for GPUs and uses FP16. This might introduce convergence issues and we are still investigating these.

2

1

14

Johannes Oswald

@oswaldjoh

17 days

So, while MesaNets offer a powerful way to memorize and learn in-context, they are compute intensive, and the gap to Transformers on global reasoning & recall still looks wide. ⁉️Should we aim to close the gaps to softmax on these benchmarks or is this a dead end?.

1

11

Johannes Oswald

@oswaldjoh

17 days

𝗗𝘆𝗻𝗮𝗺𝗶𝗰 𝘁𝗲𝘀𝘁-𝘁𝗶𝗺𝗲 𝗰𝗼𝗺𝗽𝘂𝘁𝗲. Instead of running a fixed number of CG steps (red), we can stop computation dynamically by introducing an error threshold ε (blue). 🔎 This stopping criterion naturally leads to more compute on avg. with increased sequence length

1

10

Johannes Oswald

@oswaldjoh

17 days

Therefore, we split benchmarks into 2 groups (a) global and (b) local. This should reduce noise when reporting aggregated benchmark scores. 🔎MesaNets outperforms all other linear models on global reasoning and in-context recall benchmarks. However, a gap to Transformer remains.

1

11

Johannes Oswald

@oswaldjoh

17 days

Given this finding, we trained sliding-window-attention (SWA) models to provide intuitive Transformer-like baselines with shorter context. 🔎⚠️ Intriguingly, we observe that SWAs with short windows (4-64) outperform transformers on various benchmarks. See Appendix for details!

1

12

Johannes Oswald

@oswaldjoh

17 days

📝 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗪𝗼𝗿𝗹𝗱: We find that MesaNets are strong LMs. They outperform all RNNs and the Transformer baseline w.r.t. PPL benchmarks. 🔎 We find that RNNs are quite different LMs, the they lower PPL 𝗲𝗮𝗿𝗹𝘆-𝗶𝗻-𝘁𝗵𝗲-𝘀𝗲𝗾𝘂𝗲𝗻𝗰𝗲 but get worse later on.

1

11

Johannes Oswald

@oswaldjoh

17 days

🧪 𝗦𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝗕𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝘀: MesaNet does great on synthetic benchmarks which have been shown to correlate with language modelling capabilities. First nice sign – now let’s see how these results translate into the language domain!

1

14

Johannes Oswald

@oswaldjoh

17 days

📈Intriguingly, 𝗠𝗲𝘀𝗮𝗡𝗲𝘁 𝗱𝘆𝗻𝗮𝗺𝗶𝗰𝗮𝗹𝗹𝘆 𝗮𝗹𝗹𝗼𝗰𝗮𝘁𝗲𝘀 𝘁𝗲𝘀𝘁-𝘁𝗶𝗺𝗲 𝗰𝗼𝗺𝗽𝘂𝘁𝗲 as the CG method has a stopping criterion. This opens interesting differences to softmax & classical RNNs which increase compute linear/constant with sequence length.

1

22

Johannes Oswald

@oswaldjoh

17 days

⁉️But how to compute this scary << linsolve >>? We propose to use 𝗰𝗼𝗻𝗷𝘂𝗴𝗮𝘁𝗲 𝗴𝗿𝗮𝗱𝗶𝗲𝗻𝘁𝘀 (𝗖𝗚). This enables efficient parallelizable training as the compute heavy part of CG is ~GLA — to the rescue for a fast implementation!

1

2

14

Johannes Oswald

@oswaldjoh

17 days

. we modify the Mesa Layer [, an efficient local 𝘁𝗲𝘀𝘁-𝘁𝗶𝗺𝗲 𝗼𝗽𝘁𝗶𝗺𝗶𝘇𝗲r for linear models and the squared error loss! Here, we scale MesaNets to 1B by 1️⃣ introducing parallelizable training, 2️⃣ fixing stability issues when using forget gates.

1

2

17

Johannes Oswald

@oswaldjoh

17 days

Let’s appreciate the beautiful idea of local TTT! Each layer models its current seq. of inputs by 𝗺𝗶𝗻𝗶𝗺𝗶𝘇𝗶𝗻𝗴 𝗶𝘁𝘀 𝗼𝘄𝗻 𝗹𝗮𝘆𝗲𝗿𝘄𝗶𝘀𝗲 𝗼𝗯𝗷𝗲𝗰𝘁𝗶𝘃𝗲 function at test time. For now, efficient TTT of an objective dep. on the entire seq. was difficult, but. .

1

15

Johannes Oswald

@oswaldjoh

17 days

Softmax alternatives i.e. Mamba, xLSTM, GLA and DeltaNet can be motivated by a unifying framework of test-time training (: a linear model is learned online in-context. 🚨Our MesaNet takes this to the extreme of local optimality!

1

2

30

Johannes Oswald

@oswaldjoh

28 days

RT @ninoscherrer: Excited to see João Sacramento speaking at the @KempnerInst about our recent line of work on "locally optimal test-time t….

0

4

0