Aditya Tomar @adityastomar_ X Profile

Aditya Tomar

@adityastomar_

Followers

307

Following

362

Media

12

Statuses

23

undergrad @berkeley_ai | research intern @TogetherCompute | prev-@Livermore_Lab

https://t.co/nThFeGHoaB

Joined July 2024

Don't wanna be here? Send us removal request.

Aditya Tomar

@adityastomar_

3 months

Can we break the memory wall for LLM inference via KV cache rematerialization? 🚨 Introducing XQuant, which leverages underutilized compute units to eliminate the memory bottleneck for LLM inference! • 10–12.5x memory savings vs. FP16 • Near-zero accuracy loss • Beats

26

90

667

Aditya Tomar

@adityastomar_

7 days

After spending hours debugging a kernel, TIL that TF32 is actually 19 bits. Triton’s dot function will automatically cast inputs to TF32 for devices that have tensor cores (can be disabled by passing allow_tf32=False). The amount of data moved is still equivalent to 32 bits, but

0

8

Dylan Patel

@dylan522p

3 months

Feel like I'm taking crazy pills. We are just back at step one. Don’t store KV cache, just recompute it.

Aditya Tomar

@adityastomar_

3 months

Can we break the memory wall for LLM inference via KV cache rematerialization? 🚨 Introducing XQuant, which leverages underutilized compute units to eliminate the memory bottleneck for LLM inference! • 10–12.5x memory savings vs. FP16 • Near-zero accuracy loss • Beats

29

23

541

AK

@_akhaliq

3 months

XQuant Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization

5

19

138

Aditya Tomar

@adityastomar_

3 months

📝 Summary: • KV cache is the real bottleneck for LLM inference • Compute keeps scaling faster than memory • So → rematerialize KV instead of caching it • XQuant: forward-looking, compute-for-memory tradeoff → faster inference + better scaling. • For ultra-low bit

5

0

36

Aditya Tomar

@adityastomar_

3 months

📊 Results: • 2–4 bit quantization → XQuant consistently beats KVQuant & KIVI for the same memory budget • XQuant-CL (2-bit) retains near-FP16 accuracy while pushing memory savings to the extreme. • 12.5x memory reduction with only 0.1 PPL drop and 10x memory reduction with

1

29

Aditya Tomar

@adityastomar_

3 months

⚙️Extension to GQA: • Many modern LLMs use Grouped Query Attention (GQA). • Challenge: naive X caching costs 2x more memory. • Our fix → apply SVD decomposition offline, project X into a smaller latent space, then quantize. • Bonus observation: Latent X is even easier to

1

0

23

Aditya Tomar

@adityastomar_

3 months

⚙️Method: We propose XQuant-CL, which quantizes the deltas between the X embeddings of successive layers. These deltas are extremely easy to quantize, pushing the state-of-the-art in ultra-low bit precision quantization. 🧵 [4/7]

1

25

Aditya Tomar

@adityastomar_

3 months

💡Surprising insight: input embeddings X across layers are very similar (unlike KV). Across all models, we found that X embeddings across layers are remarkably similar when compared to the similarities of KV cache activations across layers. We attribute this property to the

3

5

57

Aditya Tomar

@adityastomar_

3 months

⚙️Method: Instead of caching Key/Value activations, we cache the layer input X. • X is ½ the size of KV → 2x memory savings • Recompute KV on-the-fly → higher FLOPs but can leverage underutilized compute units (since inference is usually memory-bound). • Result: less memory

2

3

51

Rishabh Tiwari

@rish2k1

4 months

🚨Come check out our poster at #ICML2025! QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache 📍 East Exhibition Hall A-B — #E-2608 🗓️ Poster Session 5 | Thu, Jul 17 | 🕓 11:00 AM –1:30 PM TLDR: Use a quantized version of the same model as its own draft

Rishabh Tiwari

@rish2k1

9 months

🚀 Fast and accurate Speculative Decoding for Long Context? 🔎Problem: 🔹Standard speculative decoding struggles with long-context generation, as current draft models are pretty weak for long context 🔹Finding the right draft model is tricky, as compatibility varies across

0

6

37

Aditya Tomar

@adityastomar_

6 months

Accepted to ICML 2025!

Aditya Tomar

@adityastomar_

9 months

Check out our new paper! Special thanks to @tiwarishabh16, @HaochengXiUCB, and @coleman_hooper1 for their close guidance and willingness to answer my unintelligent questions 😁. Find more details on our analysis of LLM performance bottlenecks in the following threads. 🧵 [1/6]

0

3

Aditya Tomar

@adityastomar_

9 months

Memory-bound decoding optimization solution: ✅ Quantize weights → Speeds up short-context inference ✅ Quantize KV cache → Boosts long-context efficiency

0

1

Aditya Tomar

@adityastomar_

9 months

We find that prefill is entirely compute-bound & decode is entirely memory-bound! We focus on optimizing decode. Looking at aggregate figure, for short ctx. loading model weights is the primary bottleneck; for long ctx., loading the KV cache is the primary bottleneck. 🧵 [6/6]

1

0

1

Aditya Tomar

@adityastomar_

9 months

But how to categorize operations as compute or memory-bound? We use a roofline model for Llama-2-7B FP16 inference on NVIDIA A6000 GPU. We define a ridge point as peak compute performance (FLOP/s) / peak memory-BW (GB/s) <- same units as AI! 🧵 [5/6]

1

0

1

Aditya Tomar

@adityastomar_

9 months

We perform asymptotic analysis of AI for operations during both prefill and decoding. Key Takeaway: Aggregate AI of prefill scales with sequence length, which can be > 100k. OTOH, aggregate AI of decoding scales only scales with batch size, which is < 32. 🧵 [4/6]

1

0

1

Aditya Tomar

@adityastomar_

9 months

For a finer-grained analysis, we break down the major operations in the Transformer into two categories: linear and attention. The aggregate operations of a Transformer include linear, attention, and non-linear operations (FFN activation functions, softmax, layernorm). 🧵 [3/6]

1

0

1

Aditya Tomar

@adityastomar_

9 months

We use arithmetic intensity (AI) to study LLM inference bottlenecks. AI is = #FLOPs / #MOPs. Operations with high AI are compute-bound (benefit from reducing computational complexity) & with low AI are memory-bound (benefit from optimizing memory load-store operations). 🧵 [2/6]

1

0

1

Aditya Tomar

@adityastomar_

9 months

Check out our new paper! Special thanks to @tiwarishabh16, @HaochengXiUCB, and @coleman_hooper1 for their close guidance and willingness to answer my unintelligent questions 😁. Find more details on our analysis of LLM performance bottlenecks in the following threads. 🧵 [1/6]

Rishabh Tiwari

@rish2k1

9 months

🚀 Fast and accurate Speculative Decoding for Long Context? 🔎Problem: 🔹Standard speculative decoding struggles with long-context generation, as current draft models are pretty weak for long context 🔹Finding the right draft model is tricky, as compatibility varies across

1

0

6