Piotr Nawrot @p_nawrot X Profile

Piotr Nawrot

@p_nawrot

Followers

9K

Following

960

Media

33

Statuses

401

LLM Efficiency PhD @ Edinburgh | 🥇🥈 @ Flunkyball Polish Championships | 🥇 @ Jerry Hunter Pub's Bowling Tournament | 50000 🏆 & Legendary II @ Brawl Stars

https://t.co/S5ySbfjd35

Warsaw

Joined July 2014

Don't wanna be here? Send us removal request.

Piotr Nawrot

@p_nawrot

7 months

Sparse attention is one of the most promising strategies to unlock long-context processing and long generation reasoning in LLMs. We performed the most comprehensive study on training-free sparse attention to date. Here is what we found:

13

115

661

Rıza Velioğlu

@rizavelioglu

10 days

@zpysky1125 (MiniMax M2 lead) drops truth bombs in their latest tech blog: > There’s no free lunch. When you reduce attention complexity, you pay a price. The question is, where? At scale, it hit hard: obvious weaknesses in complex, multi-hop reasoning. - @giffmana been

Pengyu Zhao

@zpysky1125

12 days

MiniMax M2 Tech Blog 3: Why Did M2 End Up as a Full Attention Model? On behave of pre-training lead Haohai Sun. ( https://t.co/WH4xOD9KrT) I. Introduction As the lead of MiniMax-M2 pretrain, I've been getting many queries from the community on "Why did you turn back the clock

1

3

10

Piotr Nawrot

@p_nawrot

11 days

Link to our study - https://t.co/onhzRqGVcv DM if you'd like to discuss this theory further or if you have any comments : )

arxiv.org

Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its viability, its efficiency-accuracy trade-offs, and systematic scaling studies remain...

0

2

11

Piotr Nawrot

@p_nawrot

11 days

> From the perspective of a year ago, a hybrid of Lightning Attention and Full Attention looked just as good as pure full attention. > Did we find a free lunch? Not quite. > The price became clear at larger scales: the model showed obvious weaknesses in complex, multi-hop

Pengyu Zhao

@zpysky1125

12 days

MiniMax M2 Tech Blog 3: Why Did M2 End Up as a Full Attention Model? On behave of pre-training lead Haohai Sun. ( https://t.co/WH4xOD9KrT) I. Introduction As the lead of MiniMax-M2 pretrain, I've been getting many queries from the community on "Why did you turn back the clock

2

21

103

Edoardo Ponti

@PontiEdoardo

2 months

I am looking for a 2-year 𝗽𝗼𝘀𝘁𝗱𝗼𝗰 to work on efficient foundation models at @InfAtEd and @EPCCed! This is part of the @ARIA_research funding for Scaling Compute: AI at 1/1000th the cost

1

17

30

Osman Batur İnce

@ospanbatyr

2 months

Multimodal models typically need millions of examples from each modality paired with text for training. With SEMI 🌓, we integrate new low-resource modalities into LLMs with as few as 32 samples — including satellite images, galaxies, sensors, and molecules. (1/6)

3

40

213

Edoardo Ponti

@PontiEdoardo

2 months

I've been awarded a Starting Grant from @ERC_Research! As part of AToM-FM ⚛️, I'll study efficient architectures for foundation models with end-to-end tokenisation and adaptive+permanent memory Building a greener, more democratic AI

European Research Council (ERC)

@ERC_Research

2 months

📣 The ERC Starting Grant call results are out! Find out which early-career researchers will receive funding, what they will be investigating, where they will be based... plus lots of other #ERCStG facts & figures for 2025! ➡️ https://t.co/cGctMhcJos 🇪🇺 #HorizonEurope

14

18

142

François Chollet

@fchollet

3 months

I'll take the other side of this bet...

David Scott Patterson

@davidpattersonx

3 months

By 2030, all jobs will be replaced by AI and robots. Easily. The US labor force is about 170 million workers. About 80 million of those jobs include hands-on work. Automated systems can work four shifts a week. Replacing all physical labor would require about 20 million

125

215

3K

Jinjie Ni

@NiJinjie

3 months

Token crisis: solved. ✅ We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch — up to 8B params, 480B tokens, 480 epochs. Findings: > DLMs beat AR when tokens are limited, with >3× data potential. > A 1B DLM trained on just 1B tokens

42

237

2K

Yiping Lu

@2prime_PKU

4 months

Anyone knows adam?

267

446

5K

Simone Scardapane

@s_scardapane

4 months

*The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs* by @p_nawrot @PontiEdoardo @cheeesio @seb_ruder They study sparse attention techniques at scale, comparing to small dense models at the same compute budget. https://t.co/8dt7ceWhMe

5

32

204

Piotr Nawrot

@p_nawrot

4 months

+1

Pasquale Minervini

@PMinervini

4 months

not sure this is a good idea -- dog puppies need 𝐚 𝐥𝐨𝐭 of sleep and being handled by many strangers can be stressful (this is also why puppy yoga is banned in some countries on animal welfare grounds)

0

2

6

Edoardo Ponti

@PontiEdoardo

4 months

Thanks for acknowledging Dynamic Token Pooling as a predecessor to H-Net, @_albertgu! We had some decent ideas in that paper (e2e and entropy-based tokenisation), but it surprises me that it took 2 years (an eternity in NLP) to find the right recipe and scale better than BPE

Albert Gu

@_albertgu

4 months

Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.

1

11

87

Edoardo Ponti

@PontiEdoardo

4 months

If you are at @icmlconf make sure to attend @AdrianLancucki’s invited talk on our inference-time *hyper*-scaling paper (and more!) at the tokenization workshop this Friday https://t.co/GhCIAtu4re

Edoardo Ponti

@PontiEdoardo

5 months

🚀 By *learning* to compress the KV cache in Transformer LLMs, we can generate more tokens for the same compute budget. This unlocks *inference-time hyper-scaling* For the same runtime or memory load, we can boost LLM accuracy by pushing reasoning even further!

0

3

21

Tokenization Workshop (TokShop) @ICML2025

@tokshop2025

4 months

The TokShop schedule is now live! Join us at #ICML2025 for invited talks, poster sessions, and a panel on the future of tokenization. https://t.co/UCdWdobEgh #Tokenization #LLM #NLProc

0

5

6

Sukjun (June) Hwang

@sukjun_hwang

4 months

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data

98

745

5K

Albert Gu

@_albertgu

4 months

Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.

Sukjun (June) Hwang

@sukjun_hwang

4 months

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data

61

189

1K

Luca Perić

@lucalp__

5 months

The Bitter Lesson is coming for Tokenization The Byte Latent Transformer (BLT) showed the possibility of finding additional scaling laws related to removing tokenization but the topic seemed to get little proper coverage...

3

8

31

Ori Press

@ori_press

4 months

Do language models have algorithmic creativity? To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!🧵⬇️

6

59

160

Beidi Chen

@BeidiChen

5 months

This is cool!!!

Piotr Nawrot

@p_nawrot

5 months

We built sparse-frontier — a clean abstraction that lets you focus on your custom sparse attention implementation while automatically inheriting vLLM’s optimizations and model support. As a PhD student, I've learned that sometimes the bottleneck in research isn't ideas — it's

2

5

32

Huiqiang Jiang

@iofu728

5 months

A very good abstraction of sparse attention in vLLM!🥳

Piotr Nawrot

@p_nawrot

5 months

We built sparse-frontier — a clean abstraction that lets you focus on your custom sparse attention implementation while automatically inheriting vLLM’s optimizations and model support. As a PhD student, I've learned that sometimes the bottleneck in research isn't ideas — it's

1

2

7