Kamran Chitsaz
@KChitsaz
Followers
147
Following
2K
Media
8
Statuses
120
Machine Learning Researcher @Mila_Quebec, MSc of Electrical Engineering at @polymtl
Montreal, QC
Joined March 2021
Long reasoning without the quadratic tax: The Markovian Thinker makes LLMs reason in chunks with a bounded state โ linear compute, constant memory and it keeps scaling beyond the training limit. 1/6
Introducing linear scaling of reasoning: ๐๐ก๐ ๐๐๐ซ๐ค๐จ๐ฏ๐ข๐๐ง ๐๐ก๐ข๐ง๐ค๐๐ซ Reformulate RL so thinking scales ๐(๐ง) ๐๐จ๐ฆ๐ฉ๐ฎ๐ญ๐, not O(n^2), with O(1) ๐ฆ๐๐ฆ๐จ๐ซ๐ฒ, architecture-agnostic. Train R1-1.5B into a markovian thinker with 96K thought budget, ~2X accuracy ๐งต
2
17
37
We show a phase transition for optimal data curation: For strong models, concentrating on difficult samples drives further improvement (LIMO). In contrast, weaker models benefit from the conventional "More is More" where broad data exposure is essential to learn core capabilities
1/n "Less is More" (s1, etc.) vs "More is More", which mantra is correct for the training/fine-tuning large LLMs? In our recent preprint, we reconcile both of these. They correspond to different parts of a complex phase diagram
0
8
13
After nearly 3 years since our NeurIPS paper, SOTA architectures are now adopting NoPE. Kimi Linear uses NoPE for all full-attention layers (not a RoPE hybrid).
The brilliant Kimi Linear paper. It's a hybrid attention that beats full attention while cutting memory by up to 75% and keeping 1M token decoding up to 6x faster. It cuts the key value cache by up to 75% and delivers up to 6x faster decoding at 1M context. Full attention is
7
34
371
[1/9] While pretraining data might be hitting a wall, novel methods for modeling it are just getting started! We introduce future summary prediction (FSP), where the model predicts future sequence embeddings to reduce teacher forcing & shortcut learning. ๐Predict a learned
10
46
216
My prediction is that next-token prediction loss will not last the test of time, and the next frontier models will need richer loss functions. In this paper, we take a step towards that, shifting from predicting a single token to predicting a summary of the future.
[1/9] While pretraining data might be hitting a wall, novel methods for modeling it are just getting started! We introduce future summary prediction (FSP), where the model predicts future sequence embeddings to reduce teacher forcing & shortcut learning. ๐Predict a learned
0
14
31
I am recruiting several graduate students (both MSc and PhD level) for Fall 2026 @ChandarLab! The application deadline is December 01. Please apply through the @Mila_Quebec supervision request process here: https://t.co/4UfkKPMRHn. More details about the recruitment process
9
159
581
I can't attend #ICCV 2025 in Honolulu, Hawaii but my amazing teammates will be there! Please stop by our poster tomorrow 21 Oct (#438) to learn about TAPNext, a general, ViT-like architecture with SOTA point tracking quality! Links: ๐ website: https://t.co/FILiZsVgQa
1
3
7
Alleviating long context issues: โIterative Amortized Inference (IAI) refines solutions step-by-step over mini-batches, just like stochastic optimization. โIAI merges: โ- Scalability of stochastic opt. (SGD). โ- Expressivity of forward-pass amortization (ICL in LLMs).
Meta on meta: thrilled to share our work on Meta-learningโฆ at Meta! ๐ฅ๐ง We make two major contributions: 1๏ธโฃ Unified framework revealing insights into various amortizations ๐ง 2๏ธโฃ Greedy belief-state updates to handle long context-lengths ๐
1
8
20
Mila's annual supervision request process is now open to receive MSc and PhD applications for Fall 2026 admission! For more information, visit https://t.co/r01eLcY1P4
3
64
120
Mila is the only academic institute in the world with 1500+ AI researchers in one place! Want to join this greatest concentration of AI talent? Apply now!
Mila's annual supervision request process is now open to receive MSc and PhD applications for Fall 2026 admission! For more information, visit https://t.co/r01eLcY1P4
1
2
33
Markovian Thinking by @Mila_Quebec & @Microsoft lets LLMs reason with a fixed-size state โ compute stays the same no matter how long the reasoning chain gets. This makes RL linear-cost and memory-constant. The teamโs Delethink RL setup trains models to be Markovian Thinkers,
5
28
165
Thanks for sharing, @TheTuringPost We propose Markovian Thinking as a new paradigm, and Delethink as a simple, concrete instantiation enabling constant-memory, linear-compute reasoning that keeps improving beyond training limits.
Markovian Thinking by @Mila_Quebec & @Microsoft lets LLMs reason with a fixed-size state โ compute stays the same no matter how long the reasoning chain gets. This makes RL linear-cost and memory-constant. The teamโs Delethink RL setup trains models to be Markovian Thinkers,
0
1
11
We RL-train a CoT model to cope with restricted context (a textual state) and obtain scalable long CoTs (no quadratic cost) + a puzzling TTS behavior where the model actually uses more tokens for harder problems. Kudos to @a_kazemnejad @MAghajohari @KChitsaz who see depth behind
Introducing linear scaling of reasoning: ๐๐ก๐ ๐๐๐ซ๐ค๐จ๐ฏ๐ข๐๐ง ๐๐ก๐ข๐ง๐ค๐๐ซ Reformulate RL so thinking scales ๐(๐ง) ๐๐จ๐ฆ๐ฉ๐ฎ๐ญ๐, not O(n^2), with O(1) ๐ฆ๐๐ฆ๐จ๐ซ๐ฒ, architecture-agnostic. Train R1-1.5B into a markovian thinker with 96K thought budget, ~2X accuracy ๐งต
0
8
23
Nice paper! Make the context for reasoning local and train an RL model with such truncation. This way the model "markovinifies" and makes use of its context efficiently!
Introducing linear scaling of reasoning: ๐๐ก๐ ๐๐๐ซ๐ค๐จ๐ฏ๐ข๐๐ง ๐๐ก๐ข๐ง๐ค๐๐ซ Reformulate RL so thinking scales ๐(๐ง) ๐๐จ๐ฆ๐ฉ๐ฎ๐ญ๐, not O(n^2), with O(1) ๐ฆ๐๐ฆ๐จ๐ซ๐ฒ, architecture-agnostic. Train R1-1.5B into a markovian thinker with 96K thought budget, ~2X accuracy ๐งต
0
3
12
A very nice read. Fixed chunks make ultra-long reasoning feasible. Very nice visualizations too! Congrats to the authors!
Introducing linear scaling of reasoning: ๐๐ก๐ ๐๐๐ซ๐ค๐จ๐ฏ๐ข๐๐ง ๐๐ก๐ข๐ง๐ค๐๐ซ Reformulate RL so thinking scales ๐(๐ง) ๐๐จ๐ฆ๐ฉ๐ฎ๐ญ๐, not O(n^2), with O(1) ๐ฆ๐๐ฆ๐จ๐ซ๐ฒ, architecture-agnostic. Train R1-1.5B into a markovian thinker with 96K thought budget, ~2X accuracy ๐งต
0
1
19
Linear time thinking, not quadratic ๐๐. A recipe to scale RL reasoning linearly -- both inference time and training. Works with existing models out of the box or one can adapt them efficiently to be native linear time thinkers.
Introducing linear scaling of reasoning: ๐๐ก๐ ๐๐๐ซ๐ค๐จ๐ฏ๐ข๐๐ง ๐๐ก๐ข๐ง๐ค๐๐ซ Reformulate RL so thinking scales ๐(๐ง) ๐๐จ๐ฆ๐ฉ๐ฎ๐ญ๐, not O(n^2), with O(1) ๐ฆ๐๐ฆ๐จ๐ซ๐ฒ, architecture-agnostic. Train R1-1.5B into a markovian thinker with 96K thought budget, ~2X accuracy ๐งต
1
18
76
Very cool work from my lab-mate @KChitsaz and collaborators! The Markovian Thinker decouples context length from reasoning length, making long-form reasoning linear-compute. The โDelethinkโ environment is such a clever idea ๐ https://t.co/WLHMLG3wjw
arxiv.org
Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL "thinking environment", where the...
Long reasoning without the quadratic tax: The Markovian Thinker makes LLMs reason in chunks with a bounded state โ linear compute, constant memory and it keeps scaling beyond the training limit. 1/6
0
2
4
๐ New Paper Alert! โจ A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens๐ฉ What do you think are some major limitations in current safety training approaches? โก๏ธ We think it's in their design: they rely on completely changing the model's distribution by
1
25
50
Huge thanks to my co-first authors @MAghajohari and @a_kazemnejad and our supervisors @apsarathchandar, @murefil, @AaronCourville , and @sivareddyg. Arxiv: https://t.co/EccG6iI1wY Code: https://t.co/pRuUFlQD7V Models:
huggingface.co
0
0
4
Even SOTA models show Markovian Thinking zero-shot: GPT-OSS-120B and Qwen3-30B-A3B recover/track LongCoT with no special prompting/training. lots of in-distribution positives at init, so RL with Delethink is primed to scale. 5/6
1
0
5