Vima Gupta Profile
Vima Gupta

@vima_gupta

Followers
54
Following
54
Media
6
Statuses
11

PhD Student at Georgia Tech @gt_computing Research Intern previously @MSFTResearch , @CerebrasSystem, @arm

Joined November 2022
Don't wanna be here? Send us removal request.
@vima_gupta
Vima Gupta
1 year
1/7 🧡 MoEs: A tale of expectation vs reality Marketing: "Only compute the expert parameters you need!" Reality: Batch 16 requests β†’ ALL experts activate At serving time (vLLM/TGI), arithmetic intensity: AI β‰ˆ (num_tokens * top_k) / total_experts In simpler terms: Your decode
4
7
32
@humphrey_shi
Humphrey Shi
1 year
As a Chinese-born professor teaching hundreds of US and international students each semester, I’m saddened by remarks stereotyping Chinese students. Talent and integrity cross borders. Let’s uphold respect and fairness that will enrich future generations and our broader society.
11
11
303
@mohit__30
Mohit Chandra
1 year
Probably I'm a little late to this, but for all those applying for PhD positions during this cycle: I have seen PhD applicants stressing a lot on getting into a "great" school rather than getting an opportunity to work with a supportive and empathetic advisor. Believe me,
2
7
80
@vima_gupta
Vima Gupta
1 year
7/7 Paper: https://t.co/a9Z0pc7fLd Code dropping soon! Work done at @gtcomputing with awesome collaborators Kartik Sinha, @anandpiyer and Ada Gavrilovska!!
0
0
6
@vima_gupta
Vima Gupta
1 year
6/7 We present Lynx 😼 Dynamically picks experts during decode = 1.5x speedup. Preserves critical routing decisions while optimizing for latency.
1
0
9
@vima_gupta
Vima Gupta
1 year
5/7 Another cool finding about how MoEs work: There's a strong hierarchy in expert selection. Your first choice expert does most of the heavy lifting. This pattern shows up across different MoE models. Makes you wonder if this is a fundamental property of mixture of experts
1
0
2
@vima_gupta
Vima Gupta
1 year
4/7 Diving deeper revealed something fascinating: MoEs behave completely differently in prefill vs decode! Prefill: Touch an expert routing, model screams Decode: Surprisingly chill about expert selection
1
0
3
@vima_gupta
Vima Gupta
1 year
3/7 The root cause? Batch-level dynamics: Left: Training ensures uniform expert use Right: Production shows extreme expert activation skew This isn't a bug. It's emergent behavior from batching diverse requests πŸ“Š
1
0
3
@vima_gupta
Vima Gupta
1 year
2/7 The deeper systems problem is fascinating: Look at decode vs prefill phase behavior: Prefill: flat latency as experts ↑ (compute masks memory) Decode: linear latency scaling (pure memory pain) Each token gen loads ALL experts from GPU memory leading to high latency πŸ”
1
0
3
@agrawalamey12
Amey Agrawal
2 years
πŸš€ Introducing Metron: Redefining LLM Serving Benchmarks! πŸ“Š Tired of misleading metrics for LLM performance? Our new paper introduces a holistic framework that captures what really matters - the user experience! πŸ§ πŸ’¬ https://t.co/Q02Fj0IUKa #LLM #AI #Benchmark
Tweet card summary image
github.com
LLM Serving Performance Evaluation Harness. Contribute to project-etalon/etalon development by creating an account on GitHub.
2
15
34
@agrawalamey12
Amey Agrawal
2 years
1/ LLM inference systems are like high-performance engines βš™οΈβ€”complex, powerful, and full of intricate settings. Efficiently deploying them to maximize GPU performance is a challenge typically tackled by experts at orgs like @OpenAI and @AIatMeta πŸš€. 🧡
1
13
39