Sander Land Profile
Sander Land

@magikarp_tokens

Followers
1K
Following
381
Media
34
Statuses
149

Breaking all the models with weird tokens

ម្បី᥀$PostalCodesNL / Oslo
Joined March 2024
Don't wanna be here? Send us removal request.
@magikarp_tokens
Sander Land
2 months
🔠 UTF-8 was never meant for language models. Yet every major tokenizer still uses it, creating unfair "byte premiums". Why should your native script cost more to tokenize? It's time for a change. 🧵👇
Tweet media one
5
38
302
@magikarp_tokens
Sander Land
9 days
Had a fantastic time at the Tokenization workshop, and really grateful for the recognition of our work with a best paper award.
@tokshop2025
Tokenization Workshop (TokShop) @ICML2025
9 days
🏆 Announcing our Best Paper Awards!.🥇 Winner: "BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization" 🥈 Runner-up: "One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression" .Congrats! 🎉
Tweet media one
2
0
22
@magikarp_tokens
Sander Land
9 days
RT @AiEleuther: Congratulations to @linguist_cat and @magikarp_tokens on winning the best paper award at the #ICML2025 Tokenizer Workshop!….
0
8
0
@magikarp_tokens
Sander Land
9 days
RT @soldni: most controversial statement so far from @alisawuffles: "tokenization research is not as cool". **very vocals disagreements fro….
0
4
0
@magikarp_tokens
Sander Land
10 days
RT @Cohere_Labs: We’re excited to share that work from our @Cohere colleague @magikarp_tokens, “BPE Stays on SCRIPT: Structured Encoding fo….
0
9
0
@magikarp_tokens
Sander Land
12 days
RT @tokshop2025: 🎤 Meet our expert panelists! Join Albert Gu, Alisa Liu, Kris Cao, Sander Land, and Yuval Pinter as they discuss the Future….
0
10
0
@magikarp_tokens
Sander Land
19 days
SCRIPT-BPE coming to ICML next week!.
@Cohere_Labs
Cohere Labs
19 days
We’re excited to share that two recent works from @Cohere and Cohere Labs, will be published at workshops next week at @icmlconf in Vancouver! 🇨🇦. 🎉Congrats to all researchers with work presented! .@simon_ycl, @cliangyu_, Sara Ahmadian, @mziizm, @magikarp_tokens, @linguist_ca
Tweet media one
0
4
29
@magikarp_tokens
Sander Land
26 days
Read more here:
0
0
5
@magikarp_tokens
Sander Land
26 days
Why do language models start by converting text to bytes? 🤔.UTF-8 solved a 1992 storage problem. LLMs have different needs. 🧵New post explaining how we can do better: Beyond Bytes ⮕. Fun fact: GPT-4o tokenizes that arrow as [b' \xe2', b'\xae', b'\x95\n\n'] 🤖💥
Tweet media one
1
4
29
@magikarp_tokens
Sander Land
1 month
RT @AiEleuther: We are launching a new speaker series at EleutherAI, focused on promoting recent research by our team and community members….
0
25
0
@magikarp_tokens
Sander Land
2 months
RT @saumyamalik44: I’m thrilled to share RewardBench 2 📊— We created a new multi-domain reward model evaluation that is substantially harde….
0
50
0
@magikarp_tokens
Sander Land
2 months
RT @linguist_cat: In other words: .(graphic design by @magikarp_tokens)
Tweet media one
0
1
0
@magikarp_tokens
Sander Land
2 months
RT @linguist_cat: Sander and I have been working on a new encoding scheme for tokenization which mitigates variable length byte sequences f….
0
4
0
@magikarp_tokens
Sander Land
2 months
5/ SCRIPT can also be used to eliminate complex regex pretokenization. 😵‍💫. Current tokenizers use giant regular expressions to break up text, and many of them have unexpected edge cases. SCRIPT gives you a simple alternative: split on points where the encoding block changes.
1
0
23
@magikarp_tokens
Sander Land
2 months
4/🚧 A simple character-boundary merge check respect prevents this 🚧 . 💯 Zero weird tokens, even in tricky scripts.🚀 Faster training with fewer pairs to track.🎯 No loss in quality, compression is even slightly better!
Tweet media one
1
0
23
@magikarp_tokens
Sander Land
2 months
3/ We also show that a simple character-boundary check improves both regular Byte-based BPE and SCRIPT-BPE 🤯.BPE normally does not care about these by default, resulting in tokens which risk broken outputs. One bad cross-character merge can create a domino effect.
Tweet media one
1
0
29
@magikarp_tokens
Sander Land
2 months
2/ Introducing SCRIPT: An encoding that treats all characters equally, designed for modern multilingual LLMs. Instead of bytes:.📦 Block token (e.g. Korean letters).🔢 Index token.SCRIPT-BPE then optimizes for the languages you care about, without a biased starting point.
1
1
43