Dylan Patel @dylan522p tweet - Feel like I'm taking crazy pills. We are just back at step one. Don’t store KV cache, just recompute it.

Dylan Patel

@dylan522p

4 months

Feel like I'm taking crazy pills. We are just back at step one. Don’t store KV cache, just recompute it.

Aditya Tomar

@adityastomar_

4 months

Can we break the memory wall for LLM inference via KV cache rematerialization? 🚨 Introducing XQuant, which leverages underutilized compute units to eliminate the memory bottleneck for LLM inference! • 10–12.5x memory savings vs. FP16 • Near-zero accuracy loss • Beats

539

Replies

roon

@tszzl

4 months

@dylan522p and they wrote a whole paper holy shit

Lucas Beyer (bl16)

@giffmana

4 months

@dylan522p Well it's really not square one. It's much more similar to MLA than to not having a kv cache.

kache

@yacineMTB

4 months

@dylan522p it's so simple why didn't anyone else think of it

Federico Cassano

@ellev3n11

4 months

@dylan522p nha this is more like bootleg MLA latent caching

Nicholas Wilt

@CUDAHandbook

4 months

@dylan522p One of the oldest tensions in the software optimization business. I’m old enough to remember when it made sense to use lookup tables to accelerate squaring an 8-bit difference. (Bonus points if you used a pointer into the middle to save a few instructions)

Emily

@IamEmily2050

4 months

@dylan522p Why did he not mention DeepSeek's work in the paper 🤔

b/acc, context platform engineer

@AccBalanced

4 months

@dylan522p Yeah, who needs full context or accuracy. That’s privileged, bougie AI

colin

@acsmif

4 months

@dylan522p Dread it. Run from it. The bitter lesson still arrives.

Suwako — e/acc

@suwakopro

4 months

@dylan522p most of software engineers never realize, simple compute is much cheaper than memory access

Emphyrio

@EmphyrioLives

4 months

@dylan522p they cache the inputs instead, which have cross layer similarity and compress better than KV. Not a crazy idea.

Aashish Reddy

@_AashishReddy

4 months

@dylan522p

Ranko Mosic

@mosicr

4 months

@dylan522p @adityastomar_ Pretty much what DeepSeek FlashMLA is doing.. they offer two FlashMLA implementations - compute and memory optimized https://t.co/378yx3sxRT

Shashank Rajput

@shashank_r12

4 months

@dylan522p Great research! However I have question: A big advantage of storing KV cache in settings like GQA and MHA is that the KV cache reads can be parallelised across GPUs when using Tensor Parallelism. Can this technique also parallelize reads across GPUs?

Prashanth (Manohar) Velidandi

@PMV_InferX

4 months

@dylan522p Recomputing instead of storing KV cache feels like circling back, but it underscores a bigger truth. inference efficiency is still wide open. The tradeoff between memory savings and compute cycles will define the next generation of LLM infra. At @InferXai we see this tension

Nathan Axcan

@AxcanNathan

4 months

@dylan522p They still store as many vectors as storing just K The 10x is reached because it quantizes more nicely For a llama 8b model I calculate the KV projection matrix takes about 16MB whereas H100 has like 50 in L2, so you can keep a whole layer’s worth while you stream in layers input

rohan

@global__void

4 months

@dylan522p they store the inputs this is basically middle ground

hedgedworld

@hedgedworld

4 months

@dylan522p read memory cost is higher than tensor computing cost... 🤣

sandya mannarswamy

@sandyasm

4 months

@dylan522p have you seen the recent cartridges paper https://t.co/jpCnwkp9Mm That seems more elegant, though it will address only prefill..

alpha

@thegreattrade1

4 months

@dylan522p ur reading gpt-5 written slop u can tell by the heavy use of →

Evi

@geteviapp

4 months

@dylan522p 7B

MrDee@SOG🫡

@sog_on_bird_app

4 months

@dylan522p @grok explain to me like I'm a kid

Brian Kurtz

@Math4Good

4 months

@dylan522p

Jordy

@KreizJordy

4 months

@dylan522p I thought this was common practice at this point…?

Siddharth Joshi

@s14joshi

4 months

@dylan522p If you can't saturate your compute and would prefer trading off bandwidth for compute I can see this totally making sense. Similar to checkpointing during training IIRC?

dogsbeforehumans

@threefiddie

4 months

@dylan522p @grok explain how this impacts potential demand for nvidia chips vs AMD chips and customer advice chips. Does this mean the older generation of chips (hoppers) can be just as useful?

silverlightwa

@ergoladdie

4 months

@dylan522p Bro, this was my reaction to this paper lmao

YT Commenter 2015

@ytcommenter2015

4 months

@dylan522p Remat/checkpointing is a common technique when you are running in a system with constrained memory. What’s the issue?