j777ro Profile Banner
Yeonju Ro Profile
Yeonju Ro

@j777ro

Followers
54
Following
30
Media
4
Statuses
10

UT Austin PhD Student @UTCompSci @utnslab @VITAGroupUT

Austin, Texas
Joined July 2022
Don't wanna be here? Send us removal request.
@j777ro
Yeonju Ro
8 months
(7/end) For more details, come to our poster on Friday, 11 AM (local time) at Poster Session 5! Find me there! 🎉. Work done with great collaborators, @ccccrs_0908.@peihao_wang @BabakEht @adityaakella from @Qualcomm @VITAGroupUT and @utnslab.
0
0
4
@j777ro
Yeonju Ro
8 months
(6/n) Read-ME outperforms other models of similar scale across multiple tasks. It achieves a high expert cache hit ratio and reduced average/tail latency compared to SoTA serving platforms. What makes this work unique? It’s a co-design of models for batched inference!
Tweet media one
Tweet media two
Tweet media three
0
0
1
@j777ro
Yeonju Ro
8 months
(5/n) How is this possible?.We found that layer-wise routers in pre-trained MoEs are largely redundant. Routing decisions between adjacent layers are highly correlated, and the final layer’s decision is almost deterministic, given the previous layer’s decision.
Tweet media one
0
0
2
@j777ro
Yeonju Ro
8 months
(4/n) Why do we need a decoupled router?.In current MoE setups, batching happens before expert selection, leading to the activation of too many experts at each layer and increased latency. Want to save memory by evicting experts? How do you predict which ones won’t be used? Is
Tweet media one
0
0
1
@j777ro
Yeonju Ro
8 months
(3/n) With Read-ME, you don’t have to worry anymore! 🚀 We decouple router layers into a single, independent router separate from the backbone LLM. This enables pre-computing expert selection, making batching and caching much easier!
Tweet media one
0
0
1
@j777ro
Yeonju Ro
8 months
(2/n) In this work, we refactor pre-trained LLMs as Router-decoupled MoEs for efficient inference. Traditional layer-wise gating disrupts batching, memory management, caching, and prefetching. Why? Because you must wait for the gating layer to decide which expert to activate.
0
0
3
@j777ro
Yeonju Ro
8 months
(1/n) Do you think token batching in MoE is inefficient? Are you looking for ways to transform pre-trained LLMs into MoEs? Then you should check out Read-ME at NeurIPS'24! 📖 .
Tweet card summary image
arxiv.org
The proliferation of large language models (LLMs) has led to the adoption of Mixture-of-Experts (MoE) architectures that dynamically leverage specialized subnetworks for improved efficiency and...
6
4
17
@j777ro
Yeonju Ro
10 months
RT @adityaakella: 🚀PhD applicants: Want to revolutionize OS design? Join @UT to build LDOS—the next-gen learned OS—and work on cutting-edge….
0
30
0