Alisa Liu @ COLM 🦙 @alisawuffles X Profile

Alisa Liu @ COLM 🦙

@alisawuffles

Followers

3K

Following

3K

Media

26

Statuses

383

final-year PhD student at @uwcse @uwnlp | on the job market!

https://t.co/I2EvJHwkGh

Joined November 2019

Don't wanna be here? Send us removal request.

Alisa Liu @ COLM 🦙

@alisawuffles

7 months

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words. When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

94

328

3K

George Grigorev

@iamgrigorev

9 days

Update! I confirm that 10x larger model trained with SuperBPE also achieves the same train loss, while val loss is even slightly lower now. So I don't see any reason why you should not use SuperBPE by default now (apart from some small nuances during MQA evaluation)

George Grigorev

@iamgrigorev

10 days

Today I'm publishing my first blog post: Tokenization from first principles. I built a Byte-level BPE tokenizer with Rust pre-tokenization and achieved encoding speed on par with huggingface tokenizers. I show ideas and algorithms including nuances of implementation, such as

2

3

24

Xiaochuang Han

@XiaochuangHan

8 days

Our team at Meta FAIR is hiring a PhD research intern for 2026. The topics broadly involve multimodal generative AI (e.g., video/image generation in addition to text), with flexible approaches across architecture/data/algorithms. Please apply via the link below, and feel free to

3

43

255

Luca Soldaini 🎀

@soldni

8 days

olmo 2 poster at 11am 100% merch sale everything must go (don’t make me travel back to seattle with swag)

3

4

61

Luca Soldaini 🎀

@soldni

9 days

yo has anyone heard of this Olmo model, loss looks good

7

12

193

George Grigorev

@iamgrigorev

10 days

Run with SuperBPE tokenizer achieves the same val loss as BPE Code: https://t.co/VAOoudSkvW Blog post: https://t.co/wxmAjq8Nxa I had a lot of fun covering all the details and running experiments. Stay tuned for more!

1

3

24

Alisa Liu @ COLM 🦙

@alisawuffles

9 days

Super happy to be at COLM!!🦙 It's been so fun to see familiar faces & make new friends. @JonathanHayase and I will be presenting SuperBPE TODAY (Wednesday) in the 🕓11AM poster session, come say hi! 📎 https://t.co/qU2gZUvZal

1

4

94

Kyle Lo

@kylelostat

11 days

flyin to #COLM2025 along w bunch of the @allen_ai team come chat w me about pretraining horror stories, data & evals, what we're cookin for next olmo, etc made a 🔥 poster for thursday sess, come say hi

0

7

66

George Grigorev

@iamgrigorev

10 days

Today I'm publishing my first blog post: Tokenization from first principles. I built a Byte-level BPE tokenizer with Rust pre-tokenization and achieved encoding speed on par with huggingface tokenizers. I show ideas and algorithms including nuances of implementation, such as

9

28

334

Alisa Liu @ COLM 🦙

@alisawuffles

15 days

@s_zhengbr went all the way, quantifying the effect of using diff tokenizations on benchmarks, identifying tasks (such as char counting) where modifying the input tokenization *improves* performance, and shedding light on the source. Read more in 📄!

arxiv.org

Modern tokenizers employ deterministic algorithms to map text into a single "canonical" token sequence, yet the same string can be encoded as many non-canonical tokenizations using the tokenizer...

0

1

2

Alisa Liu @ COLM 🦙

@alisawuffles

15 days

It began from a 🤯🤯 observation: when giving LMs text tokenized at *character*-level, its generation seemed virtually unaffected — even tho these token seqs are provably never seen in training! Suggests: functional char-level understanding, and tokenization as test-time control.

Brian Zheng

@s_zhengbr

18 days

Can a LM that has only ever seen the word “cat” tokenized as ␣cat, understand the token sequence [␣, c, a, t]? In our NeurIPS spotlight ⭐, we show that the answer is surprisingly YES, and in fact, you can even modify the tokenization at inference-time for performance gains!🧵

6

8

74

Stella Li @COLM2025

@StellaLisy

15 days

🚨What if solving a problem correctly isn't enough—cuz the WAY to reason about it based on your audience matters just as much⁉️ We introduce ✨personalized reasoning✨: proactively asking user preferences and adapting HOW models think Frontier models are not doing well at this!🧵

2

44

206

vik

@vikhyatk

17 days

@alisawuffles @moondream_ai thank you for for the great work on superbpe! we upsampled our finetuning corpus so the savings are often higher on the downstream tasks customers care about. also reduces training cost to reach the same model performance, which is important for a smaller company like us :)

0

2

4

Alisa Liu @ COLM 🦙

@alisawuffles

18 days

Super excited to see @moondream_ai's newest model use SuperBPE!! We did a little bit of analysis — using SuperBPE reduced their seqlen by 21% on average and made the token frequency distribution more uniform, meaning fewer hyper-frequent & hyper-rare tokens!

vik

@vikhyatk

28 days

Excited to release a preview of Moondream 3. A 9B param, 2B active MoE vision language model that makes no compromises; offering state-of-the-art visual reasoning while still retaining an efficient and deployment-friendly form factor.

8

18

188

Brian Zheng

@s_zhengbr

18 days

Can a LM that has only ever seen the word “cat” tokenized as ␣cat, understand the token sequence [␣, c, a, t]? In our NeurIPS spotlight ⭐, we show that the answer is surprisingly YES, and in fact, you can even modify the tokenization at inference-time for performance gains!🧵

5

15

82

Alisa Liu @ COLM 🦙

@alisawuffles

21 days

Catherine really eloquently demystifies the tensions between tokenizer-based and "tokenizer-free" language modeling, and how public disdain for tokenization is stinting progress we could make together. Highly recommend this read!!

0

1

9

Alisa Liu @ COLM 🦙

@alisawuffles

21 days

Every LM needs a way of encoding data, and any choice of encoding is a design choice. When using bytes, you borrow choices from the makers of UTF8, and there’s generally no reason to believe that the most common encoding on the internet is also the best one for language modeling.

Catherine Arnett @ 🍁 COLM🍁

@linguist_cat

22 days

I have a new blog post about the so-called “tokenizer-free” approach to language modeling and why it’s not tokenizer-free at all. I also talk about why people hate tokenizers so much!

2

8

91

Lucas Beyer (bl16)

@giffmana

21 days

Great blogpost walking through tokenization vs "tokenize free" approaches, arguing that there isn't really such thing as "tokenize free" and even using utf8 bytes inherits choices made by other people (Unicode consortium) and is not clear these are sensible for LLMs.

Catherine Arnett @ 🍁 COLM🍁

@linguist_cat

22 days

I have a new blog post about the so-called “tokenizer-free” approach to language modeling and why it’s not tokenizer-free at all. I also talk about why people hate tokenizers so much!

18

61

675

Omar Khattab

@lateinteraction

21 days

Yes. I think overly heuristic or restrictive tokenization is a problem, but "tokenization" as such is not your enemy. It's pretty much the central design element of the Transformer in the first place.

Catherine Arnett @ 🍁 COLM🍁

@linguist_cat

22 days

I have a new blog post about the so-called “tokenizer-free” approach to language modeling and why it’s not tokenizer-free at all. I also talk about why people hate tokenizers so much!

1

12

Sewoong Oh

@sewoong79

25 days

SuperBPE ( https://t.co/rqLXu0bVG6) adopted by the latest Open-source VLM of Moondream 3 by @moondreamai ! Exciting to see amazing collaboration by @alisawuffles and @jonathanhayase continue to make impact.

vik

@vikhyatk

28 days

These are some highlights, but there's lots more to talk about. We extended the context length from 2K to 32K tokens. We're using a SuperBPE tokens so our tokens are better than your tokens. We've done some things to make the weights more adaptable when you finetune. Etc. etc.

10

7

90

vik

@vikhyatk

28 days

These are some highlights, but there's lots more to talk about. We extended the context length from 2K to 32K tokens. We're using a SuperBPE tokens so our tokens are better than your tokens. We've done some things to make the weights more adaptable when you finetune. Etc. etc.

3

125