Yeonju Ro @j777ro X Profile

Yeonju Ro

@j777ro

Followers

54

Following

30

Media

4

Statuses

10

UT Austin PhD Student @UTCompSci @utnslab @VITAGroupUT

Austin, Texas

Joined July 2022

Don't wanna be here? Send us removal request.

Yeonju Ro

@j777ro

8 months

(7/end) For more details, come to our poster on Friday, 11 AM (local time) at Poster Session 5! Find me there! 🎉. Work done with great collaborators, @ccccrs_0908.@peihao_wang @BabakEht @adityaakella from @Qualcomm @VITAGroupUT and @utnslab.

0

4

Yeonju Ro

@j777ro

8 months

(6/n) Read-ME outperforms other models of similar scale across multiple tasks. It achieves a high expert cache hit ratio and reduced average/tail latency compared to SoTA serving platforms. What makes this work unique? It’s a co-design of models for batched inference!

0

1

Yeonju Ro

@j777ro

8 months

(5/n) How is this possible?.We found that layer-wise routers in pre-trained MoEs are largely redundant. Routing decisions between adjacent layers are highly correlated, and the final layer’s decision is almost deterministic, given the previous layer’s decision.

0

2

Yeonju Ro

@j777ro

8 months

(4/n) Why do we need a decoupled router?.In current MoE setups, batching happens before expert selection, leading to the activation of too many experts at each layer and increased latency. Want to save memory by evicting experts? How do you predict which ones won’t be used? Is

0

1

Yeonju Ro

@j777ro

8 months

(3/n) With Read-ME, you don’t have to worry anymore! 🚀 We decouple router layers into a single, independent router separate from the backbone LLM. This enables pre-computing expert selection, making batching and caching much easier!

0

1

Yeonju Ro

@j777ro

8 months

(2/n) In this work, we refactor pre-trained LLMs as Router-decoupled MoEs for efficient inference. Traditional layer-wise gating disrupts batching, memory management, caching, and prefetching. Why? Because you must wait for the gating layer to decide which expert to activate.

0

3

Yeonju Ro

@j777ro

8 months

(1/n) Do you think token batching in MoE is inefficient? Are you looking for ways to transform pre-trained LLMs into MoEs? Then you should check out Read-ME at NeurIPS'24! 📖 .