
Alisa Liu @ COLM 🦙
@alisawuffles
Followers
3K
Following
3K
Media
26
Statuses
383
final-year PhD student at @uwcse @uwnlp | on the job market!
Joined November 2019
We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words. When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵
94
328
3K
Update! I confirm that 10x larger model trained with SuperBPE also achieves the same train loss, while val loss is even slightly lower now. So I don't see any reason why you should not use SuperBPE by default now (apart from some small nuances during MQA evaluation)
Today I'm publishing my first blog post: Tokenization from first principles. I built a Byte-level BPE tokenizer with Rust pre-tokenization and achieved encoding speed on par with huggingface tokenizers. I show ideas and algorithms including nuances of implementation, such as
2
3
24
Our team at Meta FAIR is hiring a PhD research intern for 2026. The topics broadly involve multimodal generative AI (e.g., video/image generation in addition to text), with flexible approaches across architecture/data/algorithms. Please apply via the link below, and feel free to
3
43
255
olmo 2 poster at 11am 100% merch sale everything must go (don’t make me travel back to seattle with swag)
3
4
61
Run with SuperBPE tokenizer achieves the same val loss as BPE Code: https://t.co/VAOoudSkvW Blog post: https://t.co/wxmAjq8Nxa I had a lot of fun covering all the details and running experiments. Stay tuned for more!
1
3
24
Super happy to be at COLM!!🦙 It's been so fun to see familiar faces & make new friends. @JonathanHayase and I will be presenting SuperBPE TODAY (Wednesday) in the 🕓11AM poster session, come say hi! 📎 https://t.co/qU2gZUvZal
1
4
94
Today I'm publishing my first blog post: Tokenization from first principles. I built a Byte-level BPE tokenizer with Rust pre-tokenization and achieved encoding speed on par with huggingface tokenizers. I show ideas and algorithms including nuances of implementation, such as
9
28
334
@s_zhengbr went all the way, quantifying the effect of using diff tokenizations on benchmarks, identifying tasks (such as char counting) where modifying the input tokenization *improves* performance, and shedding light on the source. Read more in 📄!
arxiv.org
Modern tokenizers employ deterministic algorithms to map text into a single "canonical" token sequence, yet the same string can be encoded as many non-canonical tokenizations using the tokenizer...
0
1
2
It began from a 🤯🤯 observation: when giving LMs text tokenized at *character*-level, its generation seemed virtually unaffected — even tho these token seqs are provably never seen in training! Suggests: functional char-level understanding, and tokenization as test-time control.
Can a LM that has only ever seen the word “cat” tokenized as ␣cat, understand the token sequence [␣, c, a, t]? In our NeurIPS spotlight ⭐, we show that the answer is surprisingly YES, and in fact, you can even modify the tokenization at inference-time for performance gains!🧵
6
8
74
🚨What if solving a problem correctly isn't enough—cuz the WAY to reason about it based on your audience matters just as much⁉️ We introduce ✨personalized reasoning✨: proactively asking user preferences and adapting HOW models think Frontier models are not doing well at this!🧵
2
44
206
@alisawuffles @moondream_ai thank you for for the great work on superbpe! we upsampled our finetuning corpus so the savings are often higher on the downstream tasks customers care about. also reduces training cost to reach the same model performance, which is important for a smaller company like us :)
0
2
4
Super excited to see @moondream_ai's newest model use SuperBPE!! We did a little bit of analysis — using SuperBPE reduced their seqlen by 21% on average and made the token frequency distribution more uniform, meaning fewer hyper-frequent & hyper-rare tokens!
Excited to release a preview of Moondream 3. A 9B param, 2B active MoE vision language model that makes no compromises; offering state-of-the-art visual reasoning while still retaining an efficient and deployment-friendly form factor.
8
18
188
Can a LM that has only ever seen the word “cat” tokenized as ␣cat, understand the token sequence [␣, c, a, t]? In our NeurIPS spotlight ⭐, we show that the answer is surprisingly YES, and in fact, you can even modify the tokenization at inference-time for performance gains!🧵
5
15
82
Catherine really eloquently demystifies the tensions between tokenizer-based and "tokenizer-free" language modeling, and how public disdain for tokenization is stinting progress we could make together. Highly recommend this read!!
0
1
9
Every LM needs a way of encoding data, and any choice of encoding is a design choice. When using bytes, you borrow choices from the makers of UTF8, and there’s generally no reason to believe that the most common encoding on the internet is also the best one for language modeling.
I have a new blog post about the so-called “tokenizer-free” approach to language modeling and why it’s not tokenizer-free at all. I also talk about why people hate tokenizers so much!
2
8
91
Great blogpost walking through tokenization vs "tokenize free" approaches, arguing that there isn't really such thing as "tokenize free" and even using utf8 bytes inherits choices made by other people (Unicode consortium) and is not clear these are sensible for LLMs.
I have a new blog post about the so-called “tokenizer-free” approach to language modeling and why it’s not tokenizer-free at all. I also talk about why people hate tokenizers so much!
18
61
675
Yes. I think overly heuristic or restrictive tokenization is a problem, but "tokenization" as such is not your enemy. It's pretty much the central design element of the Transformer in the first place.
I have a new blog post about the so-called “tokenizer-free” approach to language modeling and why it’s not tokenizer-free at all. I also talk about why people hate tokenizers so much!
1
1
12
SuperBPE ( https://t.co/rqLXu0bVG6) adopted by the latest Open-source VLM of Moondream 3 by @moondreamai ! Exciting to see amazing collaboration by @alisawuffles and @jonathanhayase continue to make impact.
These are some highlights, but there's lots more to talk about. We extended the context length from 2K to 32K tokens. We're using a SuperBPE tokens so our tokens are better than your tokens. We've done some things to make the weights more adaptable when you finetune. Etc. etc.
10
7
90
These are some highlights, but there's lots more to talk about. We extended the context length from 2K to 32K tokens. We're using a SuperBPE tokens so our tokens are better than your tokens. We've done some things to make the weights more adaptable when you finetune. Etc. etc.
3
3
125