Feel like I'm taking crazy pills. We are just back at step one. Don’t store KV cache, just recompute it.
Can we break the memory wall for LLM inference via KV cache rematerialization? 🚨 Introducing XQuant, which leverages underutilized compute units to eliminate the memory bottleneck for LLM inference! • 10–12.5x memory savings vs. FP16 • Near-zero accuracy loss • Beats
29
23
539
Replies
@dylan522p Well it's really not square one. It's much more similar to MLA than to not having a kv cache.
0
0
36
@dylan522p One of the oldest tensions in the software optimization business. I’m old enough to remember when it made sense to use lookup tables to accelerate squaring an 8-bit difference. (Bonus points if you used a pointer into the middle to save a few instructions)
1
1
3
@dylan522p most of software engineers never realize, simple compute is much cheaper than memory access
1
0
11
@dylan522p they cache the inputs instead, which have cross layer similarity and compress better than KV. Not a crazy idea.
0
0
9
@dylan522p @adityastomar_ Pretty much what DeepSeek FlashMLA is doing.. they offer two FlashMLA implementations - compute and memory optimized https://t.co/378yx3sxRT
0
0
4
@dylan522p Great research! However I have question: A big advantage of storing KV cache in settings like GQA and MHA is that the KV cache reads can be parallelised across GPUs when using Tensor Parallelism. Can this technique also parallelize reads across GPUs?
0
0
4
@dylan522p Recomputing instead of storing KV cache feels like circling back, but it underscores a bigger truth. inference efficiency is still wide open. The tradeoff between memory savings and compute cycles will define the next generation of LLM infra. At @InferXai we see this tension
0
0
2
@dylan522p They still store as many vectors as storing just K The 10x is reached because it quantizes more nicely For a llama 8b model I calculate the KV projection matrix takes about 16MB whereas H100 has like 50 in L2, so you can keep a whole layer’s worth while you stream in layers input
0
0
2
@dylan522p have you seen the recent cartridges paper https://t.co/jpCnwkp9Mm That seems more elegant, though it will address only prefill..
0
0
1
@dylan522p If you can't saturate your compute and would prefer trading off bandwidth for compute I can see this totally making sense. Similar to checkpointing during training IIRC?
1
0
0
@dylan522p @grok explain how this impacts potential demand for nvidia chips vs AMD chips and customer advice chips. Does this mean the older generation of chips (hoppers) can be just as useful?
0
0
0
@dylan522p Remat/checkpointing is a common technique when you are running in a system with constrained memory. What’s the issue?
0
0
0