Renato Lui Geh Profile
Renato Lui Geh

@renatogeh

Followers
36
Following
17
Media
4
Statuses
6

PhD student at University of California, Los Angeles.

Joined September 2024
Don't wanna be here? Send us removal request.
@renatogeh
Renato Lui Geh
4 months
RT @zileishao: What happens if we tokenize cat as [ca, t] rather than [cat]? . LLMs are trained on just one tokenization per word, but they….
0
3
0
@renatogeh
Renato Lui Geh
10 months
Read the full paper for more details! This work is a collaboration with @HonghuaZhang2, @KareemYousrii, @benjiewang_cs and @guyvdb. 5/5.
0
0
4
@renatogeh
Renato Lui Geh
10 months
We find that there exists significant signal in non-canonical tokenizations. In fact, by computing a mixture of canonical and non-canonical tokenizations, we are able to achieve a consistent boost in accuracy for Q&A benchmarks!. 4/5
Tweet media one
1
0
4
@renatogeh
Renato Lui Geh
10 months
Unfortunately, computing the marginal probability of text over all tokenizations is also computationally hard. Instead, we approximate the marginal and by doing so are able to find evidence of signal in non-canonical tokenizations!. 3/5
Tweet media one
1
0
3
@renatogeh
Renato Lui Geh
10 months
We show that, for autoregressive models, computing the most likely tokenization is computationally hard. Despite this, we experimentally find that the canonical tokenization is overwhelmingly more likely compared to other tokenizations. But this is not always true!. 2/5
Tweet media one
Tweet media two
Tweet media three
1
0
3
@renatogeh
Renato Lui Geh
10 months
Where is the signal in LLM tokenization space?. Does it only come from the canonical (default) tokenization?. The answer is no! By looking at other ways to tokenize the same text, we get a consistent boost to LLM performance!. 1/5
Tweet media one
Tweet media two
Tweet media three
1
8
34