Kamran Chitsaz @KChitsaz X Profile

Kamran Chitsaz

@KChitsaz

Followers

147

Following

2K

Media

8

Statuses

120

Machine Learning Researcher @Mila_Quebec, MSc of Electrical Engineering at @polymtl

Montreal, QC

Joined March 2021

Don't wanna be here? Send us removal request.

Kamran Chitsaz

@KChitsaz

1 month

Long reasoning without the quadratic tax: The Markovian Thinker makes LLMs reason in chunks with a bounded state → linear compute, constant memory and it keeps scaling beyond the training limit. 1/6

Milad Aghajohari

@MAghajohari

1 month

Introducing linear scaling of reasoning: 𝐓𝐡𝐞 𝐌𝐚𝐫𝐤𝐨𝐯𝐢𝐚𝐧 𝐓𝐡𝐢𝐧𝐤𝐞𝐫 Reformulate RL so thinking scales 𝐎(𝐧) 𝐜𝐨𝐦𝐩𝐮𝐭𝐞, not O(n^2), with O(1) 𝐦𝐞𝐦𝐨𝐫𝐲, architecture-agnostic. Train R1-1.5B into a markovian thinker with 96K thought budget, ~2X accuracy 🧵

2

17

37

Mohammad Pezeshki

@mpezeshki91

2 days

We show a phase transition for optimal data curation: For strong models, concentrating on difficult samples drives further improvement (LIMO). In contrast, weaker models benefit from the conventional "More is More" where broad data exposure is essential to learn core capabilities

Elvis Dohmatob

@dohmatobelvis

3 days

1/n "Less is More" (s1, etc.) vs "More is More", which mantra is correct for the training/fine-tuning large LLMs? In our recent preprint, we reconcile both of these. They correspond to different parts of a complex phase diagram

0

8

13

Amirhossein Kazemnejad

@a_kazemnejad

6 days

After nearly 3 years since our NeurIPS paper, SOTA architectures are now adopting NoPE. Kimi Linear uses NoPE for all full-attention layers (not a RoPE hybrid).

Rohan Paul

@rohanpaul_ai

7 days

The brilliant Kimi Linear paper. It's a hybrid attention that beats full attention while cutting memory by up to 75% and keeping 1M token decoding up to 6x faster. It cuts the key value cache by up to 75% and delivers up to 6x faster decoding at 1M context. Full attention is

7

34

371

Divyat Mahajan

@divyat09

11 days

[1/9] While pretraining data might be hitting a wall, novel methods for modeling it are just getting started! We introduce future summary prediction (FSP), where the model predicts future sequence embeddings to reduce teacher forcing & shortcut learning. 📌Predict a learned

10

46

216

Mohammad Pezeshki

@mpezeshki91

10 days

My prediction is that next-token prediction loss will not last the test of time, and the next frontier models will need richer loss functions. In this paper, we take a step towards that, shifting from predicting a single token to predicting a summary of the future.

Divyat Mahajan

@divyat09

11 days

[1/9] While pretraining data might be hitting a wall, novel methods for modeling it are just getting started! We introduce future summary prediction (FSP), where the model predicts future sequence embeddings to reduce teacher forcing & shortcut learning. 📌Predict a learned

0

14

31

Sarath Chandar

@apsarathchandar

16 days

I am recruiting several graduate students (both MSc and PhD level) for Fall 2026 @ChandarLab! The application deadline is December 01. Please apply through the @Mila_Quebec supervision request process here: https://t.co/4UfkKPMRHn. More details about the recruitment process

9

159

581

Artem Zholus

@artemZholus

19 days

I can't attend #ICCV 2025 in Honolulu, Hawaii but my amazing teammates will be there! Please stop by our poster tomorrow 21 Oct (#438) to learn about TAPNext, a general, ViT-like architecture with SOTA point tracking quality! Links: 🌐 website: https://t.co/FILiZsVgQa

1

3

7

Mohammad Pezeshki

@mpezeshki91

19 days

Alleviating long context issues: Iterative Amortized Inference (IAI) refines solutions step-by-step over mini-batches, just like stochastic optimization. IAI merges: - Scalability of stochastic opt. (SGD). - Expressivity of forward-pass amortization (ICL in LLMs).

Sarthak Mittal

@sarthmit

22 days

Meta on meta: thrilled to share our work on Meta-learning… at Meta! 🔥🧠 We make two major contributions: 1️⃣ Unified framework revealing insights into various amortizations 🧠 2️⃣ Greedy belief-state updates to handle long context-lengths 🚀

1

8

20

Mila - Institut québécois d'IA

@Mila_Quebec

25 days

Mila's annual supervision request process is now open to receive MSc and PhD applications for Fall 2026 admission! For more information, visit https://t.co/r01eLcY1P4

3

64

120

Sarath Chandar

@apsarathchandar

25 days

Mila is the only academic institute in the world with 1500+ AI researchers in one place! Want to join this greatest concentration of AI talent? Apply now!

Mila - Institut québécois d'IA

@Mila_Quebec

25 days

Mila's annual supervision request process is now open to receive MSc and PhD applications for Fall 2026 admission! For more information, visit https://t.co/r01eLcY1P4

1

2

33

TuringPost

@TheTuringPost

29 days

Markovian Thinking by @Mila_Quebec & @Microsoft lets LLMs reason with a fixed-size state – compute stays the same no matter how long the reasoning chain gets. This makes RL linear-cost and memory-constant. The team’s Delethink RL setup trains models to be Markovian Thinkers,

5

28

165

Kamran Chitsaz

@KChitsaz

29 days

Thanks for sharing, @TheTuringPost We propose Markovian Thinking as a new paradigm, and Delethink as a simple, concrete instantiation enabling constant-memory, linear-compute reasoning that keeps improving beyond training limits.

TuringPost

@TheTuringPost

29 days

Markovian Thinking by @Mila_Quebec & @Microsoft lets LLMs reason with a fixed-size state – compute stays the same no matter how long the reasoning chain gets. This makes RL linear-cost and memory-constant. The team’s Delethink RL setup trains models to be Markovian Thinkers,

0

1

11

Alessandro Sordoni

@murefil

30 days

We RL-train a CoT model to cope with restricted context (a textual state) and obtain scalable long CoTs (no quadratic cost) + a puzzling TTS behavior where the model actually uses more tokens for harder problems. Kudos to @a_kazemnejad @MAghajohari @KChitsaz who see depth behind

Milad Aghajohari

@MAghajohari

1 month

Introducing linear scaling of reasoning: 𝐓𝐡𝐞 𝐌𝐚𝐫𝐤𝐨𝐯𝐢𝐚𝐧 𝐓𝐡𝐢𝐧𝐤𝐞𝐫 Reformulate RL so thinking scales 𝐎(𝐧) 𝐜𝐨𝐦𝐩𝐮𝐭𝐞, not O(n^2), with O(1) 𝐦𝐞𝐦𝐨𝐫𝐲, architecture-agnostic. Train R1-1.5B into a markovian thinker with 96K thought budget, ~2X accuracy 🧵

0

8

23

Artem Zholus

@artemZholus

1 month

Nice paper! Make the context for reasoning local and train an RL model with such truncation. This way the model "markovinifies" and makes use of its context efficiently!

Milad Aghajohari

@MAghajohari

1 month

Introducing linear scaling of reasoning: 𝐓𝐡𝐞 𝐌𝐚𝐫𝐤𝐨𝐯𝐢𝐚𝐧 𝐓𝐡𝐢𝐧𝐤𝐞𝐫 Reformulate RL so thinking scales 𝐎(𝐧) 𝐜𝐨𝐦𝐩𝐮𝐭𝐞, not O(n^2), with O(1) 𝐦𝐞𝐦𝐨𝐫𝐲, architecture-agnostic. Train R1-1.5B into a markovian thinker with 96K thought budget, ~2X accuracy 🧵

0

3

12

Mohammad Pezeshki

@mpezeshki91

30 days

A very nice read. Fixed chunks make ultra-long reasoning feasible. Very nice visualizations too! Congrats to the authors!

Milad Aghajohari

@MAghajohari

1 month

Introducing linear scaling of reasoning: 𝐓𝐡𝐞 𝐌𝐚𝐫𝐤𝐨𝐯𝐢𝐚𝐧 𝐓𝐡𝐢𝐧𝐤𝐞𝐫 Reformulate RL so thinking scales 𝐎(𝐧) 𝐜𝐨𝐦𝐩𝐮𝐭𝐞, not O(n^2), with O(1) 𝐦𝐞𝐦𝐨𝐫𝐲, architecture-agnostic. Train R1-1.5B into a markovian thinker with 96K thought budget, ~2X accuracy 🧵

0

1

19

Siva Reddy

@sivareddyg

1 month

Linear time thinking, not quadratic 🚀🚀. A recipe to scale RL reasoning linearly -- both inference time and training. Works with existing models out of the box or one can adapt them efficiently to be native linear time thinkers.

Milad Aghajohari

@MAghajohari

1 month

Introducing linear scaling of reasoning: 𝐓𝐡𝐞 𝐌𝐚𝐫𝐤𝐨𝐯𝐢𝐚𝐧 𝐓𝐡𝐢𝐧𝐤𝐞𝐫 Reformulate RL so thinking scales 𝐎(𝐧) 𝐜𝐨𝐦𝐩𝐮𝐭𝐞, not O(n^2), with O(1) 𝐦𝐞𝐦𝐨𝐫𝐲, architecture-agnostic. Train R1-1.5B into a markovian thinker with 96K thought budget, ~2X accuracy 🧵

1

18

76

Davide Baldelli

@DavideBald42296

1 month

Very cool work from my lab-mate @KChitsaz and collaborators! The Markovian Thinker decouples context length from reasoning length, making long-form reasoning linear-compute. The “Delethink” environment is such a clever idea 👏 https://t.co/WLHMLG3wjw

arxiv.org

Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL "thinking environment", where the...

Kamran Chitsaz

@KChitsaz

1 month

Long reasoning without the quadratic tax: The Markovian Thinker makes LLMs reason in chunks with a bounded state → linear compute, constant memory and it keeps scaling beyond the training limit. 1/6

0

2

4

David Heurtel-Depeiges

@HeurtelDepeiges

1 month

Check out this collab between students and prof at Mila and Microsoft.

Kamran Chitsaz

@KChitsaz

1 month

Long reasoning without the quadratic tax: The Markovian Thinker makes LLMs reason in chunks with a bounded state → linear compute, constant memory and it keeps scaling beyond the training limit. 1/6

0

1

4

Mehrnaz Mofakhami

@mhrnz_m

1 month

📃 New Paper Alert! ✨ A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens🚩 What do you think are some major limitations in current safety training approaches? ➡️ We think it's in their design: they rely on completely changing the model's distribution by

1

25

50

Kamran Chitsaz

@KChitsaz

1 month

Huge thanks to my co-first authors @MAghajohari and @a_kazemnejad and our supervisors @apsarathchandar, @murefil, @AaronCourville , and @sivareddyg. Arxiv: https://t.co/EccG6iI1wY Code: https://t.co/pRuUFlQD7V Models:

huggingface.co

0

4

Kamran Chitsaz

@KChitsaz

1 month

Even SOTA models show Markovian Thinking zero-shot: GPT-OSS-120B and Qwen3-30B-A3B recover/track LongCoT with no special prompting/training. lots of in-distribution positives at init, so RL with Delethink is primed to scale. 5/6

1

0

5