Kamran Chitsaz Profile
Kamran Chitsaz

@KChitsaz

Followers
147
Following
2K
Media
8
Statuses
120

Machine Learning Researcher @Mila_Quebec, MSc of Electrical Engineering at @polymtl

Montreal, QC
Joined March 2021
Don't wanna be here? Send us removal request.
@KChitsaz
Kamran Chitsaz
1 month
Long reasoning without the quadratic tax: The Markovian Thinker makes LLMs reason in chunks with a bounded state โ†’ linear compute, constant memory and it keeps scaling beyond the training limit. 1/6
@MAghajohari
Milad Aghajohari
1 month
Introducing linear scaling of reasoning: ๐“๐ก๐ž ๐Œ๐š๐ซ๐ค๐จ๐ฏ๐ข๐š๐ง ๐“๐ก๐ข๐ง๐ค๐ž๐ซ Reformulate RL so thinking scales ๐Ž(๐ง) ๐œ๐จ๐ฆ๐ฉ๐ฎ๐ญ๐ž, not O(n^2), with O(1) ๐ฆ๐ž๐ฆ๐จ๐ซ๐ฒ, architecture-agnostic. Train R1-1.5B into a markovian thinker with 96K thought budget, ~2X accuracy ๐Ÿงต
2
17
37
@mpezeshki91
Mohammad Pezeshki
2 days
We show a phase transition for optimal data curation: For strong models, concentrating on difficult samples drives further improvement (LIMO). In contrast, weaker models benefit from the conventional "More is More" where broad data exposure is essential to learn core capabilities
@dohmatobelvis
Elvis Dohmatob
3 days
1/n "Less is More" (s1, etc.) vs "More is More", which mantra is correct for the training/fine-tuning large LLMs? In our recent preprint, we reconcile both of these. They correspond to different parts of a complex phase diagram
0
8
13
@a_kazemnejad
Amirhossein Kazemnejad
6 days
After nearly 3 years since our NeurIPS paper, SOTA architectures are now adopting NoPE. Kimi Linear uses NoPE for all full-attention layers (not a RoPE hybrid).
@rohanpaul_ai
Rohan Paul
7 days
The brilliant Kimi Linear paper. It's a hybrid attention that beats full attention while cutting memory by up to 75% and keeping 1M token decoding up to 6x faster. It cuts the key value cache by up to 75% and delivers up to 6x faster decoding at 1M context. Full attention is
7
34
371
@divyat09
Divyat Mahajan
11 days
[1/9] While pretraining data might be hitting a wall, novel methods for modeling it are just getting started! We introduce future summary prediction (FSP), where the model predicts future sequence embeddings to reduce teacher forcing & shortcut learning. ๐Ÿ“ŒPredict a learned
10
46
216
@mpezeshki91
Mohammad Pezeshki
10 days
My prediction is that next-token prediction loss will not last the test of time, and the next frontier models will need richer loss functions. In this paper, we take a step towards that, shifting from predicting a single token to predicting a summary of the future.
@divyat09
Divyat Mahajan
11 days
[1/9] While pretraining data might be hitting a wall, novel methods for modeling it are just getting started! We introduce future summary prediction (FSP), where the model predicts future sequence embeddings to reduce teacher forcing & shortcut learning. ๐Ÿ“ŒPredict a learned
0
14
31
@apsarathchandar
Sarath Chandar
16 days
I am recruiting several graduate students (both MSc and PhD level) for Fall 2026 @ChandarLab! The application deadline is December 01. Please apply through the @Mila_Quebec supervision request process here: https://t.co/4UfkKPMRHn. More details about the recruitment process
9
159
581
@artemZholus
Artem Zholus
19 days
I can't attend #ICCV 2025 in Honolulu, Hawaii but my amazing teammates will be there! Please stop by our poster tomorrow 21 Oct (#438) to learn about TAPNext, a general, ViT-like architecture with SOTA point tracking quality! Links: ๐ŸŒ website: https://t.co/FILiZsVgQa
1
3
7
@mpezeshki91
Mohammad Pezeshki
19 days
Alleviating long context issues: โ€‹Iterative Amortized Inference (IAI) refines solutions step-by-step over mini-batches, just like stochastic optimization. โ€‹IAI merges: โ€‹- Scalability of stochastic opt. (SGD). โ€‹- Expressivity of forward-pass amortization (ICL in LLMs).
@sarthmit
Sarthak Mittal
22 days
Meta on meta: thrilled to share our work on Meta-learningโ€ฆ at Meta! ๐Ÿ”ฅ๐Ÿง  We make two major contributions: 1๏ธโƒฃ Unified framework revealing insights into various amortizations ๐Ÿง  2๏ธโƒฃ Greedy belief-state updates to handle long context-lengths ๐Ÿš€
1
8
20
@Mila_Quebec
Mila - Institut quรฉbรฉcois d'IA
25 days
Mila's annual supervision request process is now open to receive MSc and PhD applications for Fall 2026 admission! For more information, visit https://t.co/r01eLcY1P4
3
64
120
@apsarathchandar
Sarath Chandar
25 days
Mila is the only academic institute in the world with 1500+ AI researchers in one place! Want to join this greatest concentration of AI talent? Apply now!
@Mila_Quebec
Mila - Institut quรฉbรฉcois d'IA
25 days
Mila's annual supervision request process is now open to receive MSc and PhD applications for Fall 2026 admission! For more information, visit https://t.co/r01eLcY1P4
1
2
33
@TheTuringPost
TuringPost
29 days
Markovian Thinking by @Mila_Quebec & @Microsoft lets LLMs reason with a fixed-size state โ€“ compute stays the same no matter how long the reasoning chain gets. This makes RL linear-cost and memory-constant. The teamโ€™s Delethink RL setup trains models to be Markovian Thinkers,
5
28
165
@KChitsaz
Kamran Chitsaz
29 days
Thanks for sharing, @TheTuringPost We propose Markovian Thinking as a new paradigm, and Delethink as a simple, concrete instantiation enabling constant-memory, linear-compute reasoning that keeps improving beyond training limits.
@TheTuringPost
TuringPost
29 days
Markovian Thinking by @Mila_Quebec & @Microsoft lets LLMs reason with a fixed-size state โ€“ compute stays the same no matter how long the reasoning chain gets. This makes RL linear-cost and memory-constant. The teamโ€™s Delethink RL setup trains models to be Markovian Thinkers,
0
1
11
@murefil
Alessandro Sordoni
30 days
We RL-train a CoT model to cope with restricted context (a textual state) and obtain scalable long CoTs (no quadratic cost) + a puzzling TTS behavior where the model actually uses more tokens for harder problems. Kudos to @a_kazemnejad @MAghajohari @KChitsaz who see depth behind
@MAghajohari
Milad Aghajohari
1 month
Introducing linear scaling of reasoning: ๐“๐ก๐ž ๐Œ๐š๐ซ๐ค๐จ๐ฏ๐ข๐š๐ง ๐“๐ก๐ข๐ง๐ค๐ž๐ซ Reformulate RL so thinking scales ๐Ž(๐ง) ๐œ๐จ๐ฆ๐ฉ๐ฎ๐ญ๐ž, not O(n^2), with O(1) ๐ฆ๐ž๐ฆ๐จ๐ซ๐ฒ, architecture-agnostic. Train R1-1.5B into a markovian thinker with 96K thought budget, ~2X accuracy ๐Ÿงต
0
8
23
@artemZholus
Artem Zholus
1 month
Nice paper! Make the context for reasoning local and train an RL model with such truncation. This way the model "markovinifies" and makes use of its context efficiently!
@MAghajohari
Milad Aghajohari
1 month
Introducing linear scaling of reasoning: ๐“๐ก๐ž ๐Œ๐š๐ซ๐ค๐จ๐ฏ๐ข๐š๐ง ๐“๐ก๐ข๐ง๐ค๐ž๐ซ Reformulate RL so thinking scales ๐Ž(๐ง) ๐œ๐จ๐ฆ๐ฉ๐ฎ๐ญ๐ž, not O(n^2), with O(1) ๐ฆ๐ž๐ฆ๐จ๐ซ๐ฒ, architecture-agnostic. Train R1-1.5B into a markovian thinker with 96K thought budget, ~2X accuracy ๐Ÿงต
0
3
12
@mpezeshki91
Mohammad Pezeshki
30 days
A very nice read. Fixed chunks make ultra-long reasoning feasible. Very nice visualizations too! Congrats to the authors!
@MAghajohari
Milad Aghajohari
1 month
Introducing linear scaling of reasoning: ๐“๐ก๐ž ๐Œ๐š๐ซ๐ค๐จ๐ฏ๐ข๐š๐ง ๐“๐ก๐ข๐ง๐ค๐ž๐ซ Reformulate RL so thinking scales ๐Ž(๐ง) ๐œ๐จ๐ฆ๐ฉ๐ฎ๐ญ๐ž, not O(n^2), with O(1) ๐ฆ๐ž๐ฆ๐จ๐ซ๐ฒ, architecture-agnostic. Train R1-1.5B into a markovian thinker with 96K thought budget, ~2X accuracy ๐Ÿงต
0
1
19
@sivareddyg
Siva Reddy
1 month
Linear time thinking, not quadratic ๐Ÿš€๐Ÿš€. A recipe to scale RL reasoning linearly -- both inference time and training. Works with existing models out of the box or one can adapt them efficiently to be native linear time thinkers.
@MAghajohari
Milad Aghajohari
1 month
Introducing linear scaling of reasoning: ๐“๐ก๐ž ๐Œ๐š๐ซ๐ค๐จ๐ฏ๐ข๐š๐ง ๐“๐ก๐ข๐ง๐ค๐ž๐ซ Reformulate RL so thinking scales ๐Ž(๐ง) ๐œ๐จ๐ฆ๐ฉ๐ฎ๐ญ๐ž, not O(n^2), with O(1) ๐ฆ๐ž๐ฆ๐จ๐ซ๐ฒ, architecture-agnostic. Train R1-1.5B into a markovian thinker with 96K thought budget, ~2X accuracy ๐Ÿงต
1
18
76
@DavideBald42296
Davide Baldelli
1 month
Very cool work from my lab-mate @KChitsaz and collaborators! The Markovian Thinker decouples context length from reasoning length, making long-form reasoning linear-compute. The โ€œDelethinkโ€ environment is such a clever idea ๐Ÿ‘ https://t.co/WLHMLG3wjw
Tweet card summary image
arxiv.org
Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL "thinking environment", where the...
@KChitsaz
Kamran Chitsaz
1 month
Long reasoning without the quadratic tax: The Markovian Thinker makes LLMs reason in chunks with a bounded state โ†’ linear compute, constant memory and it keeps scaling beyond the training limit. 1/6
0
2
4
@HeurtelDepeiges
David Heurtel-Depeiges
1 month
Check out this collab between students and prof at Mila and Microsoft.
@KChitsaz
Kamran Chitsaz
1 month
Long reasoning without the quadratic tax: The Markovian Thinker makes LLMs reason in chunks with a bounded state โ†’ linear compute, constant memory and it keeps scaling beyond the training limit. 1/6
0
1
4
@mhrnz_m
Mehrnaz Mofakhami
1 month
๐Ÿ“ƒ New Paper Alert! โœจ A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens๐Ÿšฉ What do you think are some major limitations in current safety training approaches? โžก๏ธ We think it's in their design: they rely on completely changing the model's distribution by
1
25
50
@KChitsaz
Kamran Chitsaz
1 month
Huge thanks to my co-first authors @MAghajohari and @a_kazemnejad and our supervisors @apsarathchandar, @murefil, @AaronCourville , and @sivareddyg. Arxiv: https://t.co/EccG6iI1wY Code: https://t.co/pRuUFlQD7V Models:
Tweet card summary image
huggingface.co
0
0
4
@KChitsaz
Kamran Chitsaz
1 month
Even SOTA models show Markovian Thinking zero-shot: GPT-OSS-120B and Qwen3-30B-A3B recover/track LongCoT with no special prompting/training. lots of in-distribution positives at init, so RL with Delethink is primed to scale. 5/6
1
0
5