@dylan522p
Dylan Patel
4 months
Feel like I'm taking crazy pills. We are just back at step one. Don’t store KV cache, just recompute it.
@adityastomar_
Aditya Tomar
4 months
Can we break the memory wall for LLM inference via KV cache rematerialization? 🚨 Introducing XQuant, which leverages underutilized compute units to eliminate the memory bottleneck for LLM inference! • 10–12.5x memory savings vs. FP16 • Near-zero accuracy loss • Beats
29
23
539

Replies

@tszzl
roon
4 months
@dylan522p and they wrote a whole paper holy shit
0
1
70
@giffmana
Lucas Beyer (bl16)
4 months
@dylan522p Well it's really not square one. It's much more similar to MLA than to not having a kv cache.
0
0
36
@yacineMTB
kache
4 months
@dylan522p it's so simple why didn't anyone else think of it
1
0
32
@ellev3n11
Federico Cassano
4 months
@dylan522p nha this is more like bootleg MLA latent caching
0
0
16
@CUDAHandbook
Nicholas Wilt
4 months
@dylan522p One of the oldest tensions in the software optimization business. I’m old enough to remember when it made sense to use lookup tables to accelerate squaring an 8-bit difference. (Bonus points if you used a pointer into the middle to save a few instructions)
1
1
3
@IamEmily2050
Emily
4 months
@dylan522p Why did he not mention DeepSeek's work in the paper 🤔
0
0
1
@AccBalanced
b/acc, context platform engineer
4 months
@dylan522p Yeah, who needs full context or accuracy. That’s privileged, bougie AI
0
0
0
@acsmif
colin
4 months
@dylan522p Dread it. Run from it. The bitter lesson still arrives.
0
0
11
@suwakopro
Suwako — e/acc
4 months
@dylan522p most of software engineers never realize, simple compute is much cheaper than memory access
1
0
11
@EmphyrioLives
Emphyrio
4 months
@dylan522p they cache the inputs instead, which have cross layer similarity and compress better than KV. Not a crazy idea.
0
0
9
@_AashishReddy
Aashish Reddy
4 months
1
1
6
@mosicr
Ranko Mosic
4 months
@dylan522p @adityastomar_ Pretty much what DeepSeek FlashMLA is doing.. they offer two FlashMLA implementations - compute and memory optimized https://t.co/378yx3sxRT
0
0
4
@shashank_r12
Shashank Rajput
4 months
@dylan522p Great research! However I have question: A big advantage of storing KV cache in settings like GQA and MHA is that the KV cache reads can be parallelised across GPUs when using Tensor Parallelism. Can this technique also parallelize reads across GPUs?
0
0
4
@PMV_InferX
Prashanth (Manohar) Velidandi
4 months
@dylan522p Recomputing instead of storing KV cache feels like circling back, but it underscores a bigger truth. inference efficiency is still wide open. The tradeoff between memory savings and compute cycles will define the next generation of LLM infra. At @InferXai we see this tension
0
0
2
@AxcanNathan
Nathan Axcan
4 months
@dylan522p They still store as many vectors as storing just K The 10x is reached because it quantizes more nicely For a llama 8b model I calculate the KV projection matrix takes about 16MB whereas H100 has like 50 in L2, so you can keep a whole layer’s worth while you stream in layers input
0
0
2
@global__void
rohan
4 months
@dylan522p they store the inputs this is basically middle ground
0
0
2
@hedgedworld
hedgedworld
4 months
@dylan522p read memory cost is higher than tensor computing cost... 🤣
0
0
1
@sandyasm
sandya mannarswamy
4 months
@dylan522p have you seen the recent cartridges paper https://t.co/jpCnwkp9Mm That seems more elegant, though it will address only prefill..
0
0
1
@thegreattrade1
alpha
4 months
@dylan522p ur reading gpt-5 written slop u can tell by the heavy use of →
0
0
1
@geteviapp
Evi
4 months
0
0
0
@sog_on_bird_app
MrDee@SOG🫡
4 months
@dylan522p @grok explain to me like I'm a kid
1
0
0
@Math4Good
Brian Kurtz
4 months
0
0
0
@KreizJordy
Jordy
4 months
@dylan522p I thought this was common practice at this point…?
0
0
0
@s14joshi
Siddharth Joshi
4 months
@dylan522p If you can't saturate your compute and would prefer trading off bandwidth for compute I can see this totally making sense. Similar to checkpointing during training IIRC?
1
0
0
@threefiddie
dogsbeforehumans
4 months
@dylan522p @grok explain how this impacts potential demand for nvidia chips vs AMD chips and customer advice chips. Does this mean the older generation of chips (hoppers) can be just as useful?
0
0
0
@ergoladdie
silverlightwa
4 months
@dylan522p Bro, this was my reaction to this paper lmao
0
0
0
@ytcommenter2015
YT Commenter 2015
4 months
@dylan522p Remat/checkpointing is a common technique when you are running in a system with constrained memory. What’s the issue?
0
0
0