Vima Gupta @vima_gupta X Profile

Vima Gupta

@vima_gupta

Followers

54

Following

54

Media

6

Statuses

11

PhD Student at Georgia Tech @gt_computing Research Intern previously @MSFTResearch , @CerebrasSystem, @arm

https://t.co/kMorj9gLXt

Joined November 2022

Don't wanna be here? Send us removal request.

Vima Gupta

@vima_gupta

1 year

1/7 🧵 MoEs: A tale of expectation vs reality Marketing: "Only compute the expert parameters you need!" Reality: Batch 16 requests → ALL experts activate At serving time (vLLM/TGI), arithmetic intensity: AI ≈ (num_tokens * top_k) / total_experts In simpler terms: Your decode

4

7

32

Humphrey Shi

@humphrey_shi

1 year

As a Chinese-born professor teaching hundreds of US and international students each semester, I’m saddened by remarks stereotyping Chinese students. Talent and integrity cross borders. Let’s uphold respect and fairness that will enrich future generations and our broader society.

11

303

Mohit Chandra

@mohit__30

1 year

Probably I'm a little late to this, but for all those applying for PhD positions during this cycle: I have seen PhD applicants stressing a lot on getting into a "great" school rather than getting an opportunity to work with a supportive and empathetic advisor. Believe me,

2

7

80

Vima Gupta

@vima_gupta

1 year

7/7 Paper: https://t.co/a9Z0pc7fLd Code dropping soon! Work done at @gtcomputing with awesome collaborators Kartik Sinha, @anandpiyer and Ada Gavrilovska!!

0

6

Vima Gupta

@vima_gupta

1 year

6/7 We present Lynx 😼 Dynamically picks experts during decode = 1.5x speedup. Preserves critical routing decisions while optimizing for latency.

1

0

9

Vima Gupta

@vima_gupta

1 year

5/7 Another cool finding about how MoEs work: There's a strong hierarchy in expert selection. Your first choice expert does most of the heavy lifting. This pattern shows up across different MoE models. Makes you wonder if this is a fundamental property of mixture of experts

1

0

2

Vima Gupta

@vima_gupta

1 year

4/7 Diving deeper revealed something fascinating: MoEs behave completely differently in prefill vs decode! Prefill: Touch an expert routing, model screams Decode: Surprisingly chill about expert selection

1

0

3

Vima Gupta

@vima_gupta

1 year

3/7 The root cause? Batch-level dynamics: Left: Training ensures uniform expert use Right: Production shows extreme expert activation skew This isn't a bug. It's emergent behavior from batching diverse requests 📊

1

0

3

Vima Gupta

@vima_gupta

1 year

2/7 The deeper systems problem is fascinating: Look at decode vs prefill phase behavior: Prefill: flat latency as experts ↑ (compute masks memory) Decode: linear latency scaling (pure memory pain) Each token gen loads ALL experts from GPU memory leading to high latency 🔍

1

0

3

Amey Agrawal

@agrawalamey12

2 years

🚀 Introducing Metron: Redefining LLM Serving Benchmarks! 📊 Tired of misleading metrics for LLM performance? Our new paper introduces a holistic framework that captures what really matters - the user experience! 🧠💬 https://t.co/Q02Fj0IUKa #LLM #AI #Benchmark

github.com

LLM Serving Performance Evaluation Harness. Contribute to project-etalon/etalon development by creating an account on GitHub.

2

15

34

Amey Agrawal

@agrawalamey12

2 years

1/ LLM inference systems are like high-performance engines ⚙️—complex, powerful, and full of intricate settings. Efficiently deploying them to maximize GPU performance is a challenge typically tackled by experts at orgs like @OpenAI and @AIatMeta 🚀. 🧵

1

13

39