julien_siems Profile Banner
Julien Siems Profile
Julien Siems

@julien_siems

Followers
305
Following
504
Media
13
Statuses
76

PhD student advised by Frank Hutter working on linear RNNs and state-tracking.

Germany
Joined July 2022
Don't wanna be here? Send us removal request.
@julien_siems
Julien Siems
4 months
1/9 There is a fundamental tradeoff between parallelizability and expressivity of Large Language Models. We propose a new linear RNN architecture, DeltaProduct, that can effectively navigate this tradeoff. Here's how!
Tweet media one
3
34
185
@julien_siems
Julien Siems
1 month
RT @JakeMRobertson: We present a new approach to causal inference. Pre-trained on synthetic data, Do-PFN opens the door to a new domain: PF….
0
3
0
@julien_siems
Julien Siems
1 month
RT @riccardograzzi: 📖 (1/n) DeltaProduct's theory got an update!. 1) For any nₕ>1 (# of Householders), only 3 layers are needed to solve al….
0
2
0
@julien_siems
Julien Siems
1 month
⚡DeltaProduct update with new results:.- Characterization of DeltaProduct’s state-tracking ability.- Inspection of the hidden state’s effective rank sheds light on why DeltaProduct extrapolates better to longer sequences than DeltaNet. - Improved scaling analysis.And more!
Tweet media one
@julien_siems
Julien Siems
4 months
1/9 There is a fundamental tradeoff between parallelizability and expressivity of Large Language Models. We propose a new linear RNN architecture, DeltaProduct, that can effectively navigate this tradeoff. Here's how!
Tweet media one
0
12
54
@julien_siems
Julien Siems
2 months
RT @behrouz_ali: What makes attention the critical component for most advances in LLMs and what holds back long-term memory modules (RNNs)?….
0
135
0
@julien_siems
Julien Siems
2 months
RT @SonglinYang4: 📢 (1/16) Introducing PaTH 🛣️ — a RoPE-free contextualized position encoding scheme, built for stronger state tracking, be….
0
88
0
@julien_siems
Julien Siems
3 months
RT @BlinkDL_AI: RWKV7-G1 "GooseOne" 🪿 1.5B release: pure RNN (attention-free) reasoning model, comparable with Qwen3 1.7B and fully multili….
0
33
0
@julien_siems
Julien Siems
3 months
@leloykun @jyo_pari DeltaProduct is now available in the flash-linear-attention library: .
1
10
25
@julien_siems
Julien Siems
3 months
RT @xiaolonw: Test-Time Training (TTT) is now on Video! And not just a 5-second video. We can generate a full 1-min video!. TTT module is a….
0
179
0
@julien_siems
Julien Siems
4 months
RT @maxmbeck: Yesterday, we shared the details on our xLSTM 7B architecture. Now, let's go one level deeper🧑‍🔧. We introduce. ⚡️Tiled Flash….
0
61
0
@julien_siems
Julien Siems
4 months
RT @riccardograzzi: @julien_siems @leloykun @jyo_pari In our DeltaProduct work we also add a bit of theory to DeltaNet, showing that it can….
0
5
0
@julien_siems
Julien Siems
4 months
9/9 Also take a look at these excellent blog posts:.by @leloykun.by @jyo_pari . We also discussed state-tracking in Linear RNNs at the ASAP Seminar—watch our full talk:
2
6
32
@julien_siems
Julien Siems
4 months
8/9 This was a great project with @timurcarstensen , @ZelaArber , @FrankRHutter , @MPontil , and @riccardograzzi .Check out our Oral at the FM-Wild Workshop at @iclr_conf :.
1
0
19
@julien_siems
Julien Siems
4 months
7/9 In language modeling tasks, DeltaProduct surpasses DeltaNet across lm-eval-harness benchmarks, with notable gains in length extrapolation performance as we increase nₕ.
Tweet media one
1
0
12
@julien_siems
Julien Siems
4 months
6/9 Also on modular arithmetic with brackets, a context-free grammar, performance improves as nₕ increases.
Tweet media one
1
0
11
@julien_siems
Julien Siems
4 months
5/9 To improve state-tracking, increasing the number of Householders nₕ is more effective than increasing the number of layers l: l=1,nₕ=2 (top row) yields much better performance than l=2 nₕ=1 (bottom row) on S₃, S₄, A₅. nₕ=4 gets good performance on S₅. nₕ=1↔DeltaNet
Tweet media one
1
0
13
@julien_siems
Julien Siems
4 months
4/9 Building on this insight, DeltaProduct performs nₕ gradient steps per token (with different per-step keys and values), yielding a state-transition matrix A(xᵢ) as a product of nₕ generalized Householder transforms—interpolating between a rank-1 update and a dense matrix.
Tweet media one
1
0
12
@julien_siems
Julien Siems
4 months
3/9 Following @SonglinYang4 et al. (2024), DeltaNet can be seen as performing one gradient descent step per token on an associative recall loss, resulting in a rank-1 state-transition matrix.
Tweet media one
1
0
15
@julien_siems
Julien Siems
4 months
2/9 Linear RNNs’ expressivity depends on the state-transition matrix structure. Diagonal linear RNNs (Mamba, GLA, mLSTM) only allow token mixing. DeltaNet and RWKV-7 use a rank-1 update enabling token+channel mixing. DeltaProduct enables adjustable higher-rank updates—but how?.
1
0
15
@julien_siems
Julien Siems
4 months
RT @maxmbeck: 📢🔔I am excited to share the details on our optimized xLSTM architecture for our xLSTM 7B model!🚨. We optimized the architectu….
0
60
0