Brian Zheng Profile
Brian Zheng

@s_zhengbr

Followers
49
Following
7
Media
7
Statuses
11

Undergraduate at UW CSE. Interested in Natural Language/Music Processing. https://t.co/axy3O3tvfk

Seattle
Joined October 2024
Don't wanna be here? Send us removal request.
@s_zhengbr
Brian Zheng
2 months
Can a LM that has only ever seen the word “cat” tokenized as ␣cat, understand the token sequence [␣, c, a, t]? In our NeurIPS spotlight ⭐, we show that the answer is surprisingly YES, and in fact, you can even modify the tokenization at inference-time for performance gains!🧵
5
16
82
@s_zhengbr
Brian Zheng
2 months
Finally, we study the source of robustness and investigate the contributing factors to this phenomenon. See our paper for more.
1
0
4
@s_zhengbr
Brian Zheng
2 months
We found that alternative tokenizations lead to better performance on several benchmarks that intuitively require character-level understanding, such as constructing acronyms or counting characters. This suggests that we can find more optimal tokenization entirely at inference.
2
0
9
@s_zhengbr
Brian Zheng
2 months
We find that the finer the tokenization (and the farther away from canonical), the less performance is retained.
1
0
9
@s_zhengbr
Brian Zheng
2 months
Yet, we find that LMs remain robust to random or character-level segmentations of input text when evaluated on diverse benchmarks. LMs maintained much of their original performance, with on average 93.4 performance retention on Qwen with random segmentation.
1
0
8
@s_zhengbr
Brian Zheng
2 months
Modern tokenizers map text tokens deterministically, so LMs only see a single tokenization of any string during training. This means that there exists only one canonical token sequence to represent any particular string.
2
0
7
@s_zhengbr
Brian Zheng
2 months
We found that alternative tokenizations lead to better performance on several benchmarks that intuitively require character-level understanding, such as constructing acronyms or counting characters. This suggests that we can find more optimal tokenization entirely at inference.
0
0
0
@s_zhengbr
Brian Zheng
2 months
We find that the finer the tokenization (and the farther away from canonical), the less performance is retained..
1
0
0
@s_zhengbr
Brian Zheng
2 months
Yet, we find that LMs remain robust to random or character-level segmentations of input text when evaluated on diverse benchmarks. LMs maintained much of their original performance, with on average 93.4 performance retention on Qwen with random segmentation.
1
0
0
@s_zhengbr
Brian Zheng
2 months
Modern tokenizers map text tokens deterministically, so LMs only see a single tokenization of any string during training. This means that there exists only one canonical token sequence to represent any particular string.
1
0
0