Brian Zheng @s_zhengbr X Profile

Brian Zheng

@s_zhengbr

Followers

49

Following

7

Media

7

Statuses

11

Undergraduate at UW CSE. Interested in Natural Language/Music Processing. https://t.co/axy3O3tvfk

Seattle

Joined October 2024

Don't wanna be here? Send us removal request.

Brian Zheng

@s_zhengbr

2 months

Can a LM that has only ever seen the word “cat” tokenized as ␣cat, understand the token sequence [␣, c, a, t]? In our NeurIPS spotlight ⭐, we show that the answer is surprisingly YES, and in fact, you can even modify the tokenization at inference-time for performance gains!🧵

5

16

82

Brian Zheng

@s_zhengbr

2 months

Paper: https://t.co/zoblxa5SSa Code: https://t.co/otYHURqBf8 This work would not have been possible with the following amazing collaborators @alisawuffles @orevaahia @JonathanHayase @YejinChoinka @nlpnoah

github.com

Contribute to Brianzhengca/Tokenizer-Robustness development by creating an account on GitHub.

0

7

Brian Zheng

@s_zhengbr

2 months

Finally, we study the source of robustness and investigate the contributing factors to this phenomenon. See our paper for more.

1

0

4

Brian Zheng

@s_zhengbr

2 months

We found that alternative tokenizations lead to better performance on several benchmarks that intuitively require character-level understanding, such as constructing acronyms or counting characters. This suggests that we can find more optimal tokenization entirely at inference.

2

0

9

Brian Zheng

@s_zhengbr

2 months

We find that the finer the tokenization (and the farther away from canonical), the less performance is retained.

1

0

9

Brian Zheng

@s_zhengbr

2 months

Yet, we find that LMs remain robust to random or character-level segmentations of input text when evaluated on diverse benchmarks. LMs maintained much of their original performance, with on average 93.4 performance retention on Qwen with random segmentation.

1

0

8

Brian Zheng

@s_zhengbr

2 months

Modern tokenizers map text tokens deterministically, so LMs only see a single tokenization of any string during training. This means that there exists only one canonical token sequence to represent any particular string.

2

0

7

Brian Zheng

@s_zhengbr

2 months

We found that alternative tokenizations lead to better performance on several benchmarks that intuitively require character-level understanding, such as constructing acronyms or counting characters. This suggests that we can find more optimal tokenization entirely at inference.

0

Brian Zheng

@s_zhengbr

2 months

We find that the finer the tokenization (and the farther away from canonical), the less performance is retained..

1

0

Brian Zheng

@s_zhengbr

2 months

Yet, we find that LMs remain robust to random or character-level segmentations of input text when evaluated on diverse benchmarks. LMs maintained much of their original performance, with on average 93.4 performance retention on Qwen with random segmentation.

1

0

Brian Zheng

@s_zhengbr

2 months

Modern tokenizers map text tokens deterministically, so LMs only see a single tokenization of any string during training. This means that there exists only one canonical token sequence to represent any particular string.

1

0