Julien Siems
@julien_siems
Followers
323
Following
565
Media
13
Statuses
82
PhD student advised by Frank Hutter working on linear RNNs and state-tracking.
Germany
Joined July 2022
1/9 There is a fundamental tradeoff between parallelizability and expressivity of Large Language Models. We propose a new linear RNN architecture, DeltaProduct, that can effectively navigate this tradeoff. Here's how!
4
36
190
Thrilled to share our new paper, TempoPFN! 🚀 TempoPFN is a new foundation model trained ENTIRELY on synthetic data. Most Time Series models use massive, proprietary real-world datasets. We asked: Can we compete with just a Linear RNN and 100% fake data? (Spoiler: yes)
1
4
8
This paper shows a simple linear RNN trained on synthetic time series can do strong 0-shot forecasting. It uses a linear block named GatedDeltaProduct that keeps a running state across steps. Training and inference happen in parallel over the full sequence, so no windowing is
1
5
21
Kimi Linear Tech Report is dropped! 🚀 https://t.co/LwNB2sQnzM Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi
huggingface.co
27
192
1K
My thesis, 𝘈 𝘵𝘩𝘦𝘰𝘳𝘺 𝘰𝘧 𝘵𝘩𝘦 𝘤𝘰𝘮𝘱𝘶𝘵𝘢𝘵𝘪𝘰𝘯𝘢𝘭 𝘱𝘰𝘸𝘦𝘳 𝘢𝘯𝘥 𝘭𝘪𝘮𝘪𝘵𝘢𝘵𝘪𝘰𝘯𝘴 𝘰𝘧 𝘭𝘢𝘯𝘨𝘶𝘢𝘨𝘦 𝘮𝘰𝘥𝘦𝘭𝘪𝘯𝘨 𝘢𝘳𝘤𝘩𝘪𝘵𝘦𝘤𝘵𝘶𝘳𝘦𝘴, is now online:
8
46
386
Accepted at NeurIPS 2025, come see us in San Diego to discuss linear RNNs!
⚡DeltaProduct update with new results: - Characterization of DeltaProduct’s state-tracking ability - Inspection of the hidden state’s effective rank sheds light on why DeltaProduct extrapolates better to longer sequences than DeltaNet. - Improved scaling analysis And more!
1
14
65
We present a new approach to causal inference. Pre-trained on synthetic data, Do-PFN opens the door to a new domain: PFNs for causal inference—we are excited to announce our new paper “Do-PFN: In-Context Learning for Causal Effect Estimation” on Arxiv! 🔨🔍 A thread:
4
3
38
📖 (1/n) DeltaProduct's theory got an update! 1) For any nₕ>1 (# of Householders), only 3 layers are needed to solve all group word problems (including S5). DeltaNet and RWKV-7 use 4. 2) For any nₕ, Gated DeltaProduct can recognize any regular language
1
2
9
⚡DeltaProduct update with new results: - Characterization of DeltaProduct’s state-tracking ability - Inspection of the hidden state’s effective rank sheds light on why DeltaProduct extrapolates better to longer sequences than DeltaNet. - Improved scaling analysis And more!
1/9 There is a fundamental tradeoff between parallelizability and expressivity of Large Language Models. We propose a new linear RNN architecture, DeltaProduct, that can effectively navigate this tradeoff. Here's how!
0
13
56
What makes attention the critical component for most advances in LLMs and what holds back long-term memory modules (RNNs)? Can we strictly generalize Transformers? Presenting Atlas (A powerful Titan): a new architecture with long-term in-context memory that learns how to
25
139
956
📢 (1/16) Introducing PaTH 🛣️ — a RoPE-free contextualized position encoding scheme, built for stronger state tracking, better extrapolation, and hardware-efficient training. PaTH outperforms RoPE across short and long language modeling benchmarks https://t.co/nJItUuYKWZ
arxiv.org
The attention mechanism is a core primitive in modern large language models (LLMs) and AI more broadly. Since attention by itself is permutation-invariant, position encoding is essential for...
9
95
551
RWKV7-G1 "GooseOne" 🪿 1.5B release: pure RNN (attention-free) reasoning model, comparable with Qwen3 1.7B and fully multilingual. Chat demo & download on https://t.co/fZ7rmVKsKj Larger G1 training in progress.
RWKV papers on https://t.co/yzEb3mjBf2 : 13 new papers in Mar 2025 🔥 RWKV-7 "Goose" 🪿 is 100% RNN and a meta-in-context learner, efficiently test-time-training its state on the context via in-context gradient descent at every token in parallel.
3
33
177
@leloykun @jyo_pari DeltaProduct is now available in the flash-linear-attention library: https://t.co/n8qMogNGmE
github.com
🚀 Efficient implementations of state-of-the-art linear attention models - fla-org/flash-linear-attention
1
10
25
Test-Time Training (TTT) is now on Video! And not just a 5-second video. We can generate a full 1-min video! TTT module is an RNN module that provides an explicit and efficient memory mechanism. It models the hidden state of an RNN with a machine learning model, which is updated
32
173
1K
Yesterday, we shared the details on our xLSTM 7B architecture. Now, let's go one level deeper🧑🔧 We introduce ⚡️Tiled Flash Linear Attention (TFLA), ⚡️ A new kernel algorithm for the mLSTM and other Linear Attention variants with Gating. We find TFLA is really fast! 🧵(1/11)
3
62
348
@julien_siems @leloykun @jyo_pari In our DeltaProduct work we also add a bit of theory to DeltaNet, showing that it can solve Dihedral groups, which are the groups of symmetries of regular polygons, with only two layers. This includes S3 (symmetries of the equilateral triangle).
1
5
22
9/9 Also take a look at these excellent blog posts: https://t.co/Pe3Vb0Syp7 by @leloykun
https://t.co/9O9A2CDPU5 by @jyo_pari We also discussed state-tracking in Linear RNNs at the ASAP Seminar—watch our full talk:
2
6
32
8/9 This was a great project with @timurcarstensen , @ZelaArber , @FrankRHutter , @MPontil , and @riccardograzzi Check out our Oral at the FM-Wild Workshop at @iclr_conf : https://t.co/RKKhnb6Lvk
openreview.net
Linear Recurrent Neural Networks (linear RNNs) have emerged as competitive alternatives to Transformers for sequence modeling, offering efficient training and linear-time inference. However...
1
0
19
7/9 In language modeling tasks, DeltaProduct surpasses DeltaNet across lm-eval-harness benchmarks, with notable gains in length extrapolation performance as we increase nₕ.
1
0
12
6/9 Also on modular arithmetic with brackets, a context-free grammar, performance improves as nₕ increases.
1
0
11