adityastomar_ Profile Banner
Aditya Tomar Profile
Aditya Tomar

@adityastomar_

Followers
307
Following
362
Media
12
Statuses
23

undergrad @berkeley_ai | research intern @TogetherCompute | prev-@Livermore_Lab

Joined July 2024
Don't wanna be here? Send us removal request.
@adityastomar_
Aditya Tomar
3 months
Can we break the memory wall for LLM inference via KV cache rematerialization? 🚨 Introducing XQuant, which leverages underutilized compute units to eliminate the memory bottleneck for LLM inference! • 10–12.5x memory savings vs. FP16 • Near-zero accuracy loss • Beats
26
90
667
@adityastomar_
Aditya Tomar
7 days
After spending hours debugging a kernel, TIL that TF32 is actually 19 bits. Triton’s dot function will automatically cast inputs to TF32 for devices that have tensor cores (can be disabled by passing allow_tf32=False). The amount of data moved is still equivalent to 32 bits, but
0
0
8
@dylan522p
Dylan Patel
3 months
Feel like I'm taking crazy pills. We are just back at step one. Don’t store KV cache, just recompute it.
@adityastomar_
Aditya Tomar
3 months
Can we break the memory wall for LLM inference via KV cache rematerialization? 🚨 Introducing XQuant, which leverages underutilized compute units to eliminate the memory bottleneck for LLM inference! • 10–12.5x memory savings vs. FP16 • Near-zero accuracy loss • Beats
29
23
541
@_akhaliq
AK
3 months
XQuant Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
5
19
138
@adityastomar_
Aditya Tomar
3 months
📝 Summary: • KV cache is the real bottleneck for LLM inference • Compute keeps scaling faster than memory • So → rematerialize KV instead of caching it • XQuant: forward-looking, compute-for-memory tradeoff → faster inference + better scaling. • For ultra-low bit
5
0
36
@adityastomar_
Aditya Tomar
3 months
📊 Results: • 2–4 bit quantization → XQuant consistently beats KVQuant & KIVI for the same memory budget • XQuant-CL (2-bit) retains near-FP16 accuracy while pushing memory savings to the extreme. • 12.5x memory reduction with only 0.1 PPL drop and 10x memory reduction with
1
1
29
@adityastomar_
Aditya Tomar
3 months
⚙️Extension to GQA: • Many modern LLMs use Grouped Query Attention (GQA). • Challenge: naive X caching costs 2x more memory. • Our fix → apply SVD decomposition offline, project X into a smaller latent space, then quantize. • Bonus observation: Latent X is even easier to
1
0
23
@adityastomar_
Aditya Tomar
3 months
⚙️Method: We propose XQuant-CL, which quantizes the deltas between the X embeddings of successive layers. These deltas are extremely easy to quantize, pushing the state-of-the-art in ultra-low bit precision quantization. 🧵 [4/7]
1
1
25
@adityastomar_
Aditya Tomar
3 months
💡Surprising insight: input embeddings X across layers are very similar (unlike KV). Across all models, we found that X embeddings across layers are remarkably similar when compared to the similarities of KV cache activations across layers. We attribute this property to the
3
5
57
@adityastomar_
Aditya Tomar
3 months
⚙️Method: Instead of caching Key/Value activations, we cache the layer input X. • X is ½ the size of KV → 2x memory savings • Recompute KV on-the-fly → higher FLOPs but can leverage underutilized compute units (since inference is usually memory-bound). • Result: less memory
2
3
51
@rish2k1
Rishabh Tiwari
4 months
🚨Come check out our poster at #ICML2025! QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache 📍 East Exhibition Hall A-B — #E-2608 🗓️ Poster Session 5 | Thu, Jul 17 | 🕓 11:00 AM –1:30 PM TLDR: Use a quantized version of the same model as its own draft
@rish2k1
Rishabh Tiwari
9 months
🚀 Fast and accurate Speculative Decoding for Long Context? 🔎Problem: 🔹Standard speculative decoding struggles with long-context generation, as current draft models are pretty weak for long context 🔹Finding the right draft model is tricky, as compatibility varies across
0
6
37
@adityastomar_
Aditya Tomar
6 months
Accepted to ICML 2025!
@adityastomar_
Aditya Tomar
9 months
Check out our new paper! Special thanks to @tiwarishabh16, @HaochengXiUCB, and @coleman_hooper1 for their close guidance and willingness to answer my unintelligent questions 😁. Find more details on our analysis of LLM performance bottlenecks in the following threads. 🧵 [1/6]
0
0
3
@adityastomar_
Aditya Tomar
9 months
Memory-bound decoding optimization solution: ✅ Quantize weights → Speeds up short-context inference ✅ Quantize KV cache → Boosts long-context efficiency
0
0
1
@adityastomar_
Aditya Tomar
9 months
We find that prefill is entirely compute-bound & decode is entirely memory-bound! We focus on optimizing decode. Looking at aggregate figure, for short ctx. loading model weights is the primary bottleneck; for long ctx., loading the KV cache is the primary bottleneck. 🧵 [6/6]
1
0
1
@adityastomar_
Aditya Tomar
9 months
But how to categorize operations as compute or memory-bound? We use a roofline model for Llama-2-7B FP16 inference on NVIDIA A6000 GPU. We define a ridge point as peak compute performance (FLOP/s) / peak memory-BW (GB/s) <- same units as AI! 🧵 [5/6]
1
0
1
@adityastomar_
Aditya Tomar
9 months
We perform asymptotic analysis of AI for operations during both prefill and decoding. Key Takeaway: Aggregate AI of prefill scales with sequence length, which can be > 100k. OTOH, aggregate AI of decoding scales only scales with batch size, which is < 32. 🧵 [4/6]
1
0
1
@adityastomar_
Aditya Tomar
9 months
For a finer-grained analysis, we break down the major operations in the Transformer into two categories: linear and attention. The aggregate operations of a Transformer include linear, attention, and non-linear operations (FFN activation functions, softmax, layernorm). 🧵 [3/6]
1
0
1
@adityastomar_
Aditya Tomar
9 months
We use arithmetic intensity (AI) to study LLM inference bottlenecks. AI is = #FLOPs / #MOPs. Operations with high AI are compute-bound (benefit from reducing computational complexity) & with low AI are memory-bound (benefit from optimizing memory load-store operations). 🧵 [2/6]
1
0
1
@adityastomar_
Aditya Tomar
9 months
Check out our new paper! Special thanks to @tiwarishabh16, @HaochengXiUCB, and @coleman_hooper1 for their close guidance and willingness to answer my unintelligent questions 😁. Find more details on our analysis of LLM performance bottlenecks in the following threads. 🧵 [1/6]
@rish2k1
Rishabh Tiwari
9 months
🚀 Fast and accurate Speculative Decoding for Long Context? 🔎Problem: 🔹Standard speculative decoding struggles with long-context generation, as current draft models are pretty weak for long context 🔹Finding the right draft model is tricky, as compatibility varies across
1
0
6