Julien Siems @julien_siems X Profile

Julien Siems

@julien_siems

Followers

305

Following

504

Media

13

Statuses

76

PhD student advised by Frank Hutter working on linear RNNs and state-tracking.

Germany

Joined July 2022

Don't wanna be here? Send us removal request.

Julien Siems

@julien_siems

4 months

1/9 There is a fundamental tradeoff between parallelizability and expressivity of Large Language Models. We propose a new linear RNN architecture, DeltaProduct, that can effectively navigate this tradeoff. Here's how!

3

34

185

Julien Siems

@julien_siems

1 month

RT @JakeMRobertson: We present a new approach to causal inference. Pre-trained on synthetic data, Do-PFN opens the door to a new domain: PF….

0

3

0

Julien Siems

@julien_siems

1 month

RT @riccardograzzi: 📖 (1/n) DeltaProduct's theory got an update!. 1) For any nₕ>1 (# of Householders), only 3 layers are needed to solve al….

0

2

0

Julien Siems

@julien_siems

1 month

⚡DeltaProduct update with new results:.- Characterization of DeltaProduct’s state-tracking ability.- Inspection of the hidden state’s effective rank sheds light on why DeltaProduct extrapolates better to longer sequences than DeltaNet. - Improved scaling analysis.And more!

Julien Siems

@julien_siems

4 months

1/9 There is a fundamental tradeoff between parallelizability and expressivity of Large Language Models. We propose a new linear RNN architecture, DeltaProduct, that can effectively navigate this tradeoff. Here's how!

0

12

54

Julien Siems

@julien_siems

2 months

RT @behrouz_ali: What makes attention the critical component for most advances in LLMs and what holds back long-term memory modules (RNNs)?….

0

135

0

Julien Siems

@julien_siems

2 months

RT @SonglinYang4: 📢 (1/16) Introducing PaTH 🛣️ — a RoPE-free contextualized position encoding scheme, built for stronger state tracking, be….

0

88

0

Julien Siems

@julien_siems

3 months

RT @BlinkDL_AI: RWKV7-G1 "GooseOne" 🪿 1.5B release: pure RNN (attention-free) reasoning model, comparable with Qwen3 1.7B and fully multili….

0

33

0

Julien Siems

@julien_siems

3 months

@leloykun @jyo_pari DeltaProduct is now available in the flash-linear-attention library: .

1

10

25

Julien Siems

@julien_siems

3 months

RT @xiaolonw: Test-Time Training (TTT) is now on Video! And not just a 5-second video. We can generate a full 1-min video!. TTT module is a….

0

179

0

Julien Siems

@julien_siems

4 months

RT @maxmbeck: Yesterday, we shared the details on our xLSTM 7B architecture. Now, let's go one level deeper🧑‍🔧. We introduce. ⚡️Tiled Flash….

0

61

0

Julien Siems

@julien_siems

4 months

RT @riccardograzzi: @julien_siems @leloykun @jyo_pari In our DeltaProduct work we also add a bit of theory to DeltaNet, showing that it can….

0

5

0

Julien Siems

@julien_siems

4 months

9/9 Also take a look at these excellent blog posts:.by @leloykun.by @jyo_pari . We also discussed state-tracking in Linear RNNs at the ASAP Seminar—watch our full talk:

2

6

32

Julien Siems

@julien_siems

4 months

8/9 This was a great project with @timurcarstensen , @ZelaArber , @FrankRHutter , @MPontil , and @riccardograzzi .Check out our Oral at the FM-Wild Workshop at @iclr_conf :.

1

0

19

Julien Siems

@julien_siems

4 months

7/9 In language modeling tasks, DeltaProduct surpasses DeltaNet across lm-eval-harness benchmarks, with notable gains in length extrapolation performance as we increase nₕ.

1

0

12

Julien Siems

@julien_siems

4 months

6/9 Also on modular arithmetic with brackets, a context-free grammar, performance improves as nₕ increases.

1

0

11

Julien Siems

@julien_siems

4 months

5/9 To improve state-tracking, increasing the number of Householders nₕ is more effective than increasing the number of layers l: l=1,nₕ=2 (top row) yields much better performance than l=2 nₕ=1 (bottom row) on S₃, S₄, A₅. nₕ=4 gets good performance on S₅. nₕ=1↔DeltaNet

1

0

13

Julien Siems

@julien_siems

4 months

4/9 Building on this insight, DeltaProduct performs nₕ gradient steps per token (with different per-step keys and values), yielding a state-transition matrix A(xᵢ) as a product of nₕ generalized Householder transforms—interpolating between a rank-1 update and a dense matrix.

1

0

12

Julien Siems

@julien_siems

4 months

3/9 Following @SonglinYang4 et al. (2024), DeltaNet can be seen as performing one gradient descent step per token on an associative recall loss, resulting in a rank-1 state-transition matrix.

1

0

15

Julien Siems

@julien_siems

4 months

2/9 Linear RNNs’ expressivity depends on the state-transition matrix structure. Diagonal linear RNNs (Mamba, GLA, mLSTM) only allow token mixing. DeltaNet and RWKV-7 use a rank-1 update enabling token+channel mixing. DeltaProduct enables adjustable higher-rank updates—but how?.

1

0

15

Julien Siems

@julien_siems

4 months

RT @maxmbeck: 📢🔔I am excited to share the details on our optimized xLSTM architecture for our xLSTM 7B model!🚨. We optimized the architectu….

0

60

0