Aditya Tomar
@adityastomar_
Followers
307
Following
362
Media
12
Statuses
23
undergrad @berkeley_ai | research intern @TogetherCompute | prev-@Livermore_Lab
Joined July 2024
Can we break the memory wall for LLM inference via KV cache rematerialization? 🚨 Introducing XQuant, which leverages underutilized compute units to eliminate the memory bottleneck for LLM inference! • 10–12.5x memory savings vs. FP16 • Near-zero accuracy loss • Beats
26
90
667
After spending hours debugging a kernel, TIL that TF32 is actually 19 bits. Triton’s dot function will automatically cast inputs to TF32 for devices that have tensor cores (can be disabled by passing allow_tf32=False). The amount of data moved is still equivalent to 32 bits, but
0
0
8
Feel like I'm taking crazy pills. We are just back at step one. Don’t store KV cache, just recompute it.
Can we break the memory wall for LLM inference via KV cache rematerialization? 🚨 Introducing XQuant, which leverages underutilized compute units to eliminate the memory bottleneck for LLM inference! • 10–12.5x memory savings vs. FP16 • Near-zero accuracy loss • Beats
29
23
541
XQuant Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
5
19
138
📝 Summary: • KV cache is the real bottleneck for LLM inference • Compute keeps scaling faster than memory • So → rematerialize KV instead of caching it • XQuant: forward-looking, compute-for-memory tradeoff → faster inference + better scaling. • For ultra-low bit
5
0
36
📊 Results: • 2–4 bit quantization → XQuant consistently beats KVQuant & KIVI for the same memory budget • XQuant-CL (2-bit) retains near-FP16 accuracy while pushing memory savings to the extreme. • 12.5x memory reduction with only 0.1 PPL drop and 10x memory reduction with
1
1
29
⚙️Extension to GQA: • Many modern LLMs use Grouped Query Attention (GQA). • Challenge: naive X caching costs 2x more memory. • Our fix → apply SVD decomposition offline, project X into a smaller latent space, then quantize. • Bonus observation: Latent X is even easier to
1
0
23
⚙️Method: We propose XQuant-CL, which quantizes the deltas between the X embeddings of successive layers. These deltas are extremely easy to quantize, pushing the state-of-the-art in ultra-low bit precision quantization. 🧵 [4/7]
1
1
25
💡Surprising insight: input embeddings X across layers are very similar (unlike KV). Across all models, we found that X embeddings across layers are remarkably similar when compared to the similarities of KV cache activations across layers. We attribute this property to the
3
5
57
⚙️Method: Instead of caching Key/Value activations, we cache the layer input X. • X is ½ the size of KV → 2x memory savings • Recompute KV on-the-fly → higher FLOPs but can leverage underutilized compute units (since inference is usually memory-bound). • Result: less memory
2
3
51
🚨Come check out our poster at #ICML2025! QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache 📍 East Exhibition Hall A-B — #E-2608 🗓️ Poster Session 5 | Thu, Jul 17 | 🕓 11:00 AM –1:30 PM TLDR: Use a quantized version of the same model as its own draft
🚀 Fast and accurate Speculative Decoding for Long Context? 🔎Problem: 🔹Standard speculative decoding struggles with long-context generation, as current draft models are pretty weak for long context 🔹Finding the right draft model is tricky, as compatibility varies across
0
6
37
Accepted to ICML 2025!
Check out our new paper! Special thanks to @tiwarishabh16, @HaochengXiUCB, and @coleman_hooper1 for their close guidance and willingness to answer my unintelligent questions 😁. Find more details on our analysis of LLM performance bottlenecks in the following threads. 🧵 [1/6]
0
0
3
Memory-bound decoding optimization solution: ✅ Quantize weights → Speeds up short-context inference ✅ Quantize KV cache → Boosts long-context efficiency
0
0
1
We find that prefill is entirely compute-bound & decode is entirely memory-bound! We focus on optimizing decode. Looking at aggregate figure, for short ctx. loading model weights is the primary bottleneck; for long ctx., loading the KV cache is the primary bottleneck. 🧵 [6/6]
1
0
1
But how to categorize operations as compute or memory-bound? We use a roofline model for Llama-2-7B FP16 inference on NVIDIA A6000 GPU. We define a ridge point as peak compute performance (FLOP/s) / peak memory-BW (GB/s) <- same units as AI! 🧵 [5/6]
1
0
1
We perform asymptotic analysis of AI for operations during both prefill and decoding. Key Takeaway: Aggregate AI of prefill scales with sequence length, which can be > 100k. OTOH, aggregate AI of decoding scales only scales with batch size, which is < 32. 🧵 [4/6]
1
0
1
For a finer-grained analysis, we break down the major operations in the Transformer into two categories: linear and attention. The aggregate operations of a Transformer include linear, attention, and non-linear operations (FFN activation functions, softmax, layernorm). 🧵 [3/6]
1
0
1
Check out our new paper! Special thanks to @tiwarishabh16, @HaochengXiUCB, and @coleman_hooper1 for their close guidance and willingness to answer my unintelligent questions 😁. Find more details on our analysis of LLM performance bottlenecks in the following threads. 🧵 [1/6]
🚀 Fast and accurate Speculative Decoding for Long Context? 🔎Problem: 🔹Standard speculative decoding struggles with long-context generation, as current draft models are pretty weak for long context 🔹Finding the right draft model is tricky, as compatibility varies across
1
0
6