Vima Gupta
@vima_gupta
Followers
54
Following
54
Media
6
Statuses
11
PhD Student at Georgia Tech @gt_computing Research Intern previously @MSFTResearch , @CerebrasSystem, @arm
Joined November 2022
1/7 π§΅ MoEs: A tale of expectation vs reality Marketing: "Only compute the expert parameters you need!" Reality: Batch 16 requests β ALL experts activate At serving time (vLLM/TGI), arithmetic intensity: AI β (num_tokens * top_k) / total_experts In simpler terms: Your decode
4
7
32
As a Chinese-born professor teaching hundreds of US and international students each semester, Iβm saddened by remarks stereotyping Chinese students. Talent and integrity cross borders. Letβs uphold respect and fairness that will enrich future generations and our broader society.
11
11
303
Probably I'm a little late to this, but for all those applying for PhD positions during this cycle: I have seen PhD applicants stressing a lot on getting into a "great" school rather than getting an opportunity to work with a supportive and empathetic advisor. Believe me,
2
7
80
7/7 Paper: https://t.co/a9Z0pc7fLd Code dropping soon! Work done at @gtcomputing with awesome collaborators Kartik Sinha, @anandpiyer and Ada Gavrilovska!!
0
0
6
6/7 We present Lynx πΌ Dynamically picks experts during decode = 1.5x speedup. Preserves critical routing decisions while optimizing for latency.
1
0
9
5/7 Another cool finding about how MoEs work: There's a strong hierarchy in expert selection. Your first choice expert does most of the heavy lifting. This pattern shows up across different MoE models. Makes you wonder if this is a fundamental property of mixture of experts
1
0
2
4/7 Diving deeper revealed something fascinating: MoEs behave completely differently in prefill vs decode! Prefill: Touch an expert routing, model screams Decode: Surprisingly chill about expert selection
1
0
3
3/7 The root cause? Batch-level dynamics: Left: Training ensures uniform expert use Right: Production shows extreme expert activation skew This isn't a bug. It's emergent behavior from batching diverse requests π
1
0
3
2/7 The deeper systems problem is fascinating: Look at decode vs prefill phase behavior: Prefill: flat latency as experts β (compute masks memory) Decode: linear latency scaling (pure memory pain) Each token gen loads ALL experts from GPU memory leading to high latency π
1
0
3
π Introducing Metron: Redefining LLM Serving Benchmarks! π Tired of misleading metrics for LLM performance? Our new paper introduces a holistic framework that captures what really matters - the user experience! π§ π¬ https://t.co/Q02Fj0IUKa
#LLM #AI #Benchmark
github.com
LLM Serving Performance Evaluation Harness. Contribute to project-etalon/etalon development by creating an account on GitHub.
2
15
34