julien_siems Profile Banner
Julien Siems Profile
Julien Siems

@julien_siems

Followers
323
Following
565
Media
13
Statuses
82

PhD student advised by Frank Hutter working on linear RNNs and state-tracking.

Germany
Joined July 2022
Don't wanna be here? Send us removal request.
@julien_siems
Julien Siems
8 months
1/9 There is a fundamental tradeoff between parallelizability and expressivity of Large Language Models. We propose a new linear RNN architecture, DeltaProduct, that can effectively navigate this tradeoff. Here's how!
4
36
190
@vlad_moroshan
Vladyslav Moroshan
6 days
Thrilled to share our new paper, TempoPFN! 🚀 TempoPFN is a new foundation model trained ENTIRELY on synthetic data. Most Time Series models use massive, proprietary real-world datasets. We asked: Can we compete with just a Linear RNN and 100% fake data? (Spoiler: yes)
1
4
8
@rohanpaul_ai
Rohan Paul
8 days
This paper shows a simple linear RNN trained on synthetic time series can do strong 0-shot forecasting. It uses a linear block named GatedDeltaProduct that keeps a running state across steps. Training and inference happen in parallel over the full sequence, so no windowing is
1
5
21
@Kimi_Moonshot
Kimi.ai
12 days
Kimi Linear Tech Report is dropped! 🚀 https://t.co/LwNB2sQnzM Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi
Tweet card summary image
huggingface.co
27
192
1K
@lambdaviking
William Merrill
1 month
My thesis, 𝘈 𝘵𝘩𝘦𝘰𝘳𝘺 𝘰𝘧 𝘵𝘩𝘦 𝘤𝘰𝘮𝘱𝘶𝘵𝘢𝘵𝘪𝘰𝘯𝘢𝘭 𝘱𝘰𝘸𝘦𝘳 𝘢𝘯𝘥 𝘭𝘪𝘮𝘪𝘵𝘢𝘵𝘪𝘰𝘯𝘴 𝘰𝘧 𝘭𝘢𝘯𝘨𝘶𝘢𝘨𝘦 𝘮𝘰𝘥𝘦𝘭𝘪𝘯𝘨 𝘢𝘳𝘤𝘩𝘪𝘵𝘦𝘤𝘵𝘶𝘳𝘦𝘴, is now online:
8
46
386
@julien_siems
Julien Siems
2 months
Accepted at NeurIPS 2025, come see us in San Diego to discuss linear RNNs!
@julien_siems
Julien Siems
5 months
⚡DeltaProduct update with new results: - Characterization of DeltaProduct’s state-tracking ability - Inspection of the hidden state’s effective rank sheds light on why DeltaProduct extrapolates better to longer sequences than DeltaNet. - Improved scaling analysis And more!
1
14
65
@JustinLin610
Junyang Lin
2 months
34
48
588
@JakeMRobertson
Jake Robertson
5 months
We present a new approach to causal inference. Pre-trained on synthetic data, Do-PFN opens the door to a new domain: PFNs for causal inference—we are excited to announce our new paper “Do-PFN: In-Context Learning for Causal Effect Estimation” on Arxiv! 🔨🔍 A thread:
4
3
38
@riccardograzzi
Riccardo Grazzi
5 months
📖 (1/n) DeltaProduct's theory got an update! 1) For any nₕ>1 (# of Householders), only 3 layers are needed to solve all group word problems (including S5). DeltaNet and RWKV-7 use 4. 2) For any nₕ, Gated DeltaProduct can recognize any regular language
1
2
9
@julien_siems
Julien Siems
5 months
⚡DeltaProduct update with new results: - Characterization of DeltaProduct’s state-tracking ability - Inspection of the hidden state’s effective rank sheds light on why DeltaProduct extrapolates better to longer sequences than DeltaNet. - Improved scaling analysis And more!
@julien_siems
Julien Siems
8 months
1/9 There is a fundamental tradeoff between parallelizability and expressivity of Large Language Models. We propose a new linear RNN architecture, DeltaProduct, that can effectively navigate this tradeoff. Here's how!
0
13
56
@behrouz_ali
Ali Behrouz
6 months
What makes attention the critical component for most advances in LLMs and what holds back long-term memory modules (RNNs)? Can we strictly generalize Transformers? Presenting Atlas (A powerful Titan): a new architecture with long-term in-context memory that learns how to
25
139
956
@SonglinYang4
Songlin Yang
6 months
📢 (1/16) Introducing PaTH 🛣️ — a RoPE-free contextualized position encoding scheme, built for stronger state tracking, better extrapolation, and hardware-efficient training. PaTH outperforms RoPE across short and long language modeling benchmarks https://t.co/nJItUuYKWZ
Tweet card summary image
arxiv.org
The attention mechanism is a core primitive in modern large language models (LLMs) and AI more broadly. Since attention by itself is permutation-invariant, position encoding is essential for...
9
95
551
@BlinkDL_AI
BlinkDL
7 months
RWKV7-G1 "GooseOne" 🪿 1.5B release: pure RNN (attention-free) reasoning model, comparable with Qwen3 1.7B and fully multilingual. Chat demo & download on https://t.co/fZ7rmVKsKj Larger G1 training in progress.
@BlinkDL_AI
BlinkDL
7 months
RWKV papers on https://t.co/yzEb3mjBf2 : 13 new papers in Mar 2025 🔥 RWKV-7 "Goose" 🪿 is 100% RNN and a meta-in-context learner, efficiently test-time-training its state on the context via in-context gradient descent at every token in parallel.
3
33
177
@xiaolonw
Xiaolong Wang
7 months
Test-Time Training (TTT) is now on Video! And not just a 5-second video. We can generate a full 1-min video! TTT module is an RNN module that provides an explicit and efficient memory mechanism. It models the hidden state of an RNN with a machine learning model, which is updated
32
173
1K
@maxmbeck
Maximilian Beck
8 months
Yesterday, we shared the details on our xLSTM 7B architecture. Now, let's go one level deeper🧑‍🔧 We introduce ⚡️Tiled Flash Linear Attention (TFLA), ⚡️ A new kernel algorithm for the mLSTM and other Linear Attention variants with Gating. We find TFLA is really fast! 🧵(1/11)
3
62
348
@riccardograzzi
Riccardo Grazzi
8 months
@julien_siems @leloykun @jyo_pari In our DeltaProduct work we also add a bit of theory to DeltaNet, showing that it can solve Dihedral groups, which are the groups of symmetries of regular polygons, with only two layers. This includes S3 (symmetries of the equilateral triangle).
1
5
22
@julien_siems
Julien Siems
8 months
9/9 Also take a look at these excellent blog posts: https://t.co/Pe3Vb0Syp7 by @leloykun https://t.co/9O9A2CDPU5 by @jyo_pari We also discussed state-tracking in Linear RNNs at the ASAP Seminar—watch our full talk:
2
6
32
@julien_siems
Julien Siems
8 months
7/9 In language modeling tasks, DeltaProduct surpasses DeltaNet across lm-eval-harness benchmarks, with notable gains in length extrapolation performance as we increase nₕ.
1
0
12
@julien_siems
Julien Siems
8 months
6/9 Also on modular arithmetic with brackets, a context-free grammar, performance improves as nₕ increases.
1
0
11