Piotr Nawrot Profile
Piotr Nawrot

@p_nawrot

Followers
9K
Following
960
Media
33
Statuses
401

LLM Efficiency PhD @ Edinburgh | 🥇🥈 @ Flunkyball Polish Championships | 🥇 @ Jerry Hunter Pub's Bowling Tournament | 50000 🏆 & Legendary II @ Brawl Stars

Warsaw
Joined July 2014
Don't wanna be here? Send us removal request.
@p_nawrot
Piotr Nawrot
7 months
Sparse attention is one of the most promising strategies to unlock long-context processing and long generation reasoning in LLMs. We performed the most comprehensive study on training-free sparse attention to date. Here is what we found:
13
115
661
@rizavelioglu
Rıza Velioğlu
10 days
@zpysky1125 (MiniMax M2 lead) drops truth bombs in their latest tech blog: > There’s no free lunch. When you reduce attention complexity, you pay a price. The question is, where? At scale, it hit hard: obvious weaknesses in complex, multi-hop reasoning. - @giffmana been
@zpysky1125
Pengyu Zhao
12 days
MiniMax M2 Tech Blog 3: Why Did M2 End Up as a Full Attention Model? On behave of pre-training lead Haohai Sun. ( https://t.co/WH4xOD9KrT) I. Introduction As the lead of MiniMax-M2 pretrain, I've been getting many queries from the community on "Why did you turn back the clock
1
3
10
@p_nawrot
Piotr Nawrot
11 days
> From the perspective of a year ago, a hybrid of Lightning Attention and Full Attention looked just as good as pure full attention. > Did we find a free lunch? Not quite. > The price became clear at larger scales: the model showed obvious weaknesses in complex, multi-hop
@zpysky1125
Pengyu Zhao
12 days
MiniMax M2 Tech Blog 3: Why Did M2 End Up as a Full Attention Model? On behave of pre-training lead Haohai Sun. ( https://t.co/WH4xOD9KrT) I. Introduction As the lead of MiniMax-M2 pretrain, I've been getting many queries from the community on "Why did you turn back the clock
2
21
103
@PontiEdoardo
Edoardo Ponti
2 months
I am looking for a 2-year 𝗽𝗼𝘀𝘁𝗱𝗼𝗰 to work on efficient foundation models at @InfAtEd and @EPCCed! This is part of the @ARIA_research funding for Scaling Compute: AI at 1/1000th the cost
1
17
30
@ospanbatyr
Osman Batur İnce
2 months
Multimodal models typically need millions of examples from each modality paired with text for training. With SEMI 🌓, we integrate new low-resource modalities into LLMs with as few as 32 samples — including satellite images, galaxies, sensors, and molecules. (1/6)
3
40
213
@PontiEdoardo
Edoardo Ponti
2 months
I've been awarded a Starting Grant from @ERC_Research! As part of AToM-FM ⚛️, I'll study efficient architectures for foundation models with end-to-end tokenisation and adaptive+permanent memory Building a greener, more democratic AI
@ERC_Research
European Research Council (ERC)
2 months
📣 The ERC Starting Grant call results are out! Find out which early-career researchers will receive funding, what they will be investigating, where they will be based... plus lots of other #ERCStG facts & figures for 2025! ➡️ https://t.co/cGctMhcJos 🇪🇺 #HorizonEurope
14
18
142
@fchollet
François Chollet
3 months
I'll take the other side of this bet...
@davidpattersonx
David Scott Patterson
3 months
By 2030, all jobs will be replaced by AI and robots. Easily. The US labor force is about 170 million workers. About 80 million of those jobs include hands-on work. Automated systems can work four shifts a week. Replacing all physical labor would require about 20 million
125
215
3K
@NiJinjie
Jinjie Ni
3 months
Token crisis: solved. ✅ We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch — up to 8B params, 480B tokens, 480 epochs. Findings: > DLMs beat AR when tokens are limited, with >3× data potential. > A 1B DLM trained on just 1B tokens
42
237
2K
@2prime_PKU
Yiping Lu
4 months
Anyone knows adam?
267
446
5K
@s_scardapane
Simone Scardapane
4 months
*The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs* by @p_nawrot @PontiEdoardo @cheeesio @seb_ruder They study sparse attention techniques at scale, comparing to small dense models at the same compute budget. https://t.co/8dt7ceWhMe
5
32
204
@p_nawrot
Piotr Nawrot
4 months
+1
@PMinervini
Pasquale Minervini
4 months
not sure this is a good idea -- dog puppies need 𝐚 𝐥𝐨𝐭 of sleep and being handled by many strangers can be stressful (this is also why puppy yoga is banned in some countries on animal welfare grounds)
0
2
6
@PontiEdoardo
Edoardo Ponti
4 months
Thanks for acknowledging Dynamic Token Pooling as a predecessor to H-Net, @_albertgu! We had some decent ideas in that paper (e2e and entropy-based tokenisation), but it surprises me that it took 2 years (an eternity in NLP) to find the right recipe and scale better than BPE
@_albertgu
Albert Gu
4 months
Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.
1
11
87
@PontiEdoardo
Edoardo Ponti
4 months
If you are at @icmlconf make sure to attend @AdrianLancucki’s invited talk on our inference-time *hyper*-scaling paper (and more!) at the tokenization workshop this Friday https://t.co/GhCIAtu4re
@PontiEdoardo
Edoardo Ponti
5 months
🚀 By *learning* to compress the KV cache in Transformer LLMs, we can generate more tokens for the same compute budget. This unlocks *inference-time hyper-scaling* For the same runtime or memory load, we can boost LLM accuracy by pushing reasoning even further!
0
3
21
@tokshop2025
Tokenization Workshop (TokShop) @ICML2025
4 months
The TokShop schedule is now live! Join us at #ICML2025 for invited talks, poster sessions, and a panel on the future of tokenization. https://t.co/UCdWdobEgh #Tokenization #LLM #NLProc
0
5
6
@sukjun_hwang
Sukjun (June) Hwang
4 months
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
98
745
5K
@_albertgu
Albert Gu
4 months
Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.
@sukjun_hwang
Sukjun (June) Hwang
4 months
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
61
189
1K
@lucalp__
Luca Perić
5 months
The Bitter Lesson is coming for Tokenization The Byte Latent Transformer (BLT) showed the possibility of finding additional scaling laws related to removing tokenization but the topic seemed to get little proper coverage...
3
8
31
@ori_press
Ori Press
4 months
Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️
6
59
160
@BeidiChen
Beidi Chen
5 months
This is cool!!!
@p_nawrot
Piotr Nawrot
5 months
We built sparse-frontier — a clean abstraction that lets you focus on your custom sparse attention implementation while automatically inheriting vLLM’s optimizations and model support. As a PhD student, I've learned that sometimes the bottleneck in research isn't ideas — it's
2
5
32
@iofu728
Huiqiang Jiang
5 months
A very good abstraction of sparse attention in vLLM!🥳
@p_nawrot
Piotr Nawrot
5 months
We built sparse-frontier — a clean abstraction that lets you focus on your custom sparse attention implementation while automatically inheriting vLLM’s optimizations and model support. As a PhD student, I've learned that sometimes the bottleneck in research isn't ideas — it's
1
2
7