Explore tweets tagged as #KVcache
@_m0se_
OpenMOSE
25 days
RWKV-Qwen3-30B Hybrid runs on just 4GB VRAM 32k ctx😀 8bit kvcache cpu-moe-offload 39/48
0
0
4
@karminski3
karminski-牙医
5 months
LMCache 又弄了个新优化, 能将RAG 速度提升 3 倍 做法也是极其简单但有效,在RAG系统中,从数据库或者接口中检索出来的数据,在一个任务中是会复用的,但是每次显卡缓存全都要被冲刷掉重新计算。于是这个叫做 CacheBlend 的新的优化,会将 KV Cache 全部缓存起来,即使 KVCache 对应的内容出现在
6
14
110
@_m0se_
OpenMOSE
1 month
My RWKV journey continues. hxa079+ Qwen3-32B Hybrid RWKV 18 tps on AMD W7900 (48GB) Extreme Low KVCache(12.5%) Only 8 Layers Full Attention 100k passkey gsm8k 87.11% 8x more multi-batches from original model Targeting RTX 4090 (24GB) with ctx65k!
1
2
6
@sarbjeetjohal
Sarbjeet Johal
1 month
Growing size of #KVCache (Key-Value Cache) has become a challenge in the memory space. After morning sessions at #DF25, I am now at #OCPSummit25 in San Jose attending a session on Heterogeneous Memory Opportunity with #AgenticAI and Memory Centric Computing by Jinin So -
3
2
8
@hikettei
:hikettei🌙
7 months
人類はONNXに変わる新たな中間表現を作った方がいいと思う。特にKVCache(or Control Flow)とかSamplingも表現できる規格を誰か作るだけでエッジデバイスへのLLM移植が格段に楽になると思うんだよ!!!!!!
1
12
73
@PliopsLtd
Pliops
15 days
⚡️ GenAI is entering its context-driven era—and Pliops is leading the way. We are making RAG, Vector Search, and KV Cache Offload not just possible, but practical at scale. Find out mor 👉 https://t.co/8opccTHXUj #GenAI #RAG #VectorSearch #KVCache #LightningAI #AIInfrastructure
0
2
2
@airshaped
rt machine 🇺🇦
1 year
with the new koboldcpp 4-bit kvcache quantization i can actually run llama 3 8b on my 4gb gpu! (granted, only in IQ3_S 3.66bpw)
1
1
5
@XingangP
Xingang Pan
3 months
Introducing 𝗦𝗧𝗿𝗲𝗮𝗺𝟯𝗥, a new 3D geometric foundation model for efficient 3D reconstruction from streaming input. Similar to LLMs, STream3R uses casual attention during training and KVCache at inference. No need to worry about post-alignment or reconstructing from scratch.
@GROS17121524
Yushi LAN
3 months
🔥Streaming-based 3D/4D Foundation Model🔥 We present STream3R, which reformulates dense 3D/4D reconstruction into a sequential registration task with **causal attention**. - Projects: https://t.co/zrLlvxJ0FJ - Code: https://t.co/ONYaJDrjhF - Model:
5
58
320
@YouJiacheng
You Jiacheng
2 months
BTW, They released a deep dive on FP8 KVCache of main MLA. https://t.co/1jse6UM6rS so, actually ≈1/5 compared to FP8 dense MLA.
@YouJiacheng
You Jiacheng
2 months
As expected, NSA is not compatible with MLA, so DeepSeek chose another method: use a smaller (d=128) attention (w/o value) as the indexer. Asymptotic cost ratio = 128/576. In addition, indexer uses FP8 while main MLA uses 16-bit, so = 64/576 = 1/9.
1
34
244
@karminski3
karminski-牙医
7 months
看来好多人都在熬夜肝,KTransformer 支持运行 Qwen3 啦! Xeon 铂金 4 代 + 4090 运行 Qwen3-235B-A22B 单个请求可以达到 13.8 token/s, 4个请求并行可以达到总计 24.4 token/s 地址: https://t.co/6BNAm6DSMK
3
27
138
@ayakaneko
Neko · 絢香猫 [email protected]
5 months
今日暴论:从个人角度上来说,我喜欢 Dynamo 的编排方式,喜欢 llm-d 的 ModelService,喜欢 AIBrix 的网关架构,喜欢 vLLM 的 KVCache / PagedAttention 建议找个神级团队过来揉一揉
@ayakaneko
Neko · 絢香猫 [email protected]
5 months
把最新最热 LLM 分布式推理的 infra llm-d https://t.co/le0dJPecf7 各个组件的代码全部撕完了 难以言喻,感觉缺太多东西了... 而且巨量依赖 Gateway API 的那个新的 Inference Extension 做路由(但又不做完全) 可能在架构和资源设计上有的地方考虑的还不如 Bento / Dynamo 周到
0
2
25
@karminski3
karminski-牙医
6 months
学到一个新的小技巧,跟 AI 对话的时候,尤其是多个context反复对话,一直保持一个session 的情况,如果你的算力非常紧张。那么能不带动态内容,尽量不要带动态内容,比如当前时间戳,会导致 KVCache 失效。 因为这样会导致每次生成的 token 序列不同,迫使模型重新计算整个序列的
5
10
102
@t0kenl1mit
vincent
3 months
Looking at some examples in torch and tinygrad, think I have the KVCache and MultiHeadAttention in tinygrad written out. Now to work on the transformer layers and stitch it all together
1
0
4
@9hills
九原客
2 years
SGLang,lmsys的新推理框架。 后端主要是引入了新的KVCache机制提升速度,前端则引入了类似微软Guidance的机制更好的控制LLM输出。
@arena
lmarena.ai
2 years
We are thrilled to introduce SGLang, our next-generation interface and runtime for LLM inference! It greatly improves the execution and programming efficiency of complex LLM programs by co-designing the front-end language and back-end runtime. On the backend, we propose
6
12
51
@pshishodiaa
Prashant Shishodia
6 months
vllm is extremely focused on KVCache rather than improving latencies. like partitioning a model across multiple GPUs can also improve latencies significantly but they just give away all the new memory to KVCache. is there any other framework that treats latency as a first class
0
0
2
@CVEnew
CVE
3 days
CVE-2025-60455 Unsafe Deserialization vulnerability in Modular Max Serve before 25.6, specifically when the "--experimental-enable-kvcache-agent" feature is used allowing attackers …
0
0
0
@hjc4869
David Huang
8 months
9070XT当前运行llama.cpp的性能/效率如图(开启FA并且使用q8_0的kvcache量化) ROCm尚未正式支持RDNA4,需要dev分支rocWMMA/hipBLASLt并对llama.cpp进行修改 可以看到虽然整体效率比起RDNA3已经有明显改进,但依然有进步空间。考虑到目前单独测试hipBLASLt性能也不太理想,此处需要高情商:未来可期
1
1
28
@lmcache
LMCache Lab
7 months
🚀𝗠𝗼𝗼𝗻𝗰𝗮𝗰𝗸𝗲 X 𝗟𝗠𝗖𝗮𝗰𝗵𝗲: KV Cache-centric Language Model Serving 🚀 We're thrilled to announce a strategic collaboration between LMCache and Mooncake to pioneer a KVCache-centric Large Language Model (LLM) serving system! This partnership is set to redefine
1
10
22
@AntoineGrondin
Antoine
5 months
👀 Large Model Inference on Hetergeneous chips using vLLM, by @kerthcet (@daocloud_io) challenges for LLM inference: heterogeneous clusters: - variety of chips - high-end GPU sold out - perf/cost tradeoffs AI gateway: needs to be aware of - LLM-wise routing, KVCache, LoRA
2
3
7