Brian Zheng
@s_zhengbr
Followers
49
Following
7
Media
7
Statuses
11
Undergraduate at UW CSE. Interested in Natural Language/Music Processing. https://t.co/axy3O3tvfk
Seattle
Joined October 2024
Can a LM that has only ever seen the word “cat” tokenized as ␣cat, understand the token sequence [␣, c, a, t]? In our NeurIPS spotlight ⭐, we show that the answer is surprisingly YES, and in fact, you can even modify the tokenization at inference-time for performance gains!🧵
5
16
82
Paper: https://t.co/zoblxa5SSa Code: https://t.co/otYHURqBf8 This work would not have been possible with the following amazing collaborators @alisawuffles @orevaahia @JonathanHayase @YejinChoinka @nlpnoah
github.com
Contribute to Brianzhengca/Tokenizer-Robustness development by creating an account on GitHub.
0
0
7
Finally, we study the source of robustness and investigate the contributing factors to this phenomenon. See our paper for more.
1
0
4
We found that alternative tokenizations lead to better performance on several benchmarks that intuitively require character-level understanding, such as constructing acronyms or counting characters. This suggests that we can find more optimal tokenization entirely at inference.
2
0
9
We find that the finer the tokenization (and the farther away from canonical), the less performance is retained.
1
0
9
Yet, we find that LMs remain robust to random or character-level segmentations of input text when evaluated on diverse benchmarks. LMs maintained much of their original performance, with on average 93.4 performance retention on Qwen with random segmentation.
1
0
8
Modern tokenizers map text tokens deterministically, so LMs only see a single tokenization of any string during training. This means that there exists only one canonical token sequence to represent any particular string.
2
0
7
We found that alternative tokenizations lead to better performance on several benchmarks that intuitively require character-level understanding, such as constructing acronyms or counting characters. This suggests that we can find more optimal tokenization entirely at inference.
0
0
0
We find that the finer the tokenization (and the farther away from canonical), the less performance is retained..
1
0
0
Yet, we find that LMs remain robust to random or character-level segmentations of input text when evaluated on diverse benchmarks. LMs maintained much of their original performance, with on average 93.4 performance retention on Qwen with random segmentation.
1
0
0
Modern tokenizers map text tokens deterministically, so LMs only see a single tokenization of any string during training. This means that there exists only one canonical token sequence to represent any particular string.
1
0
0