Explore tweets tagged as #KVcache
RWKV-Qwen3-30B Hybrid runs on just 4GB VRAM 32k ctx😀 8bit kvcache cpu-moe-offload 39/48
0
0
4
LMCache 又弄了个新优化, 能将RAG 速度提升 3 倍 做法也是极其简单但有效,在RAG系统中,从数据库或者接口中检索出来的数据,在一个任务中是会复用的,但是每次显卡缓存全都要被冲刷掉重新计算。于是这个叫做 CacheBlend 的新的优化,会将 KV Cache 全部缓存起来,即使 KVCache 对应的内容出现在
6
14
110
My RWKV journey continues. hxa079+ Qwen3-32B Hybrid RWKV 18 tps on AMD W7900 (48GB) Extreme Low KVCache(12.5%) Only 8 Layers Full Attention 100k passkey gsm8k 87.11% 8x more multi-batches from original model Targeting RTX 4090 (24GB) with ctx65k!
1
2
6
Growing size of #KVCache (Key-Value Cache) has become a challenge in the memory space. After morning sessions at #DF25, I am now at #OCPSummit25 in San Jose attending a session on Heterogeneous Memory Opportunity with #AgenticAI and Memory Centric Computing by Jinin So -
3
2
8
人類はONNXに変わる新たな中間表現を作った方がいいと思う。特にKVCache(or Control Flow)とかSamplingも表現できる規格を誰か作るだけでエッジデバイスへのLLM移植が格段に楽になると思うんだよ!!!!!!
1
12
73
⚡️ GenAI is entering its context-driven era—and Pliops is leading the way. We are making RAG, Vector Search, and KV Cache Offload not just possible, but practical at scale. Find out mor 👉 https://t.co/8opccTHXUj
#GenAI #RAG #VectorSearch #KVCache #LightningAI #AIInfrastructure
0
2
2
with the new koboldcpp 4-bit kvcache quantization i can actually run llama 3 8b on my 4gb gpu! (granted, only in IQ3_S 3.66bpw)
1
1
5
Introducing 𝗦𝗧𝗿𝗲𝗮𝗺𝟯𝗥, a new 3D geometric foundation model for efficient 3D reconstruction from streaming input. Similar to LLMs, STream3R uses casual attention during training and KVCache at inference. No need to worry about post-alignment or reconstructing from scratch.
🔥Streaming-based 3D/4D Foundation Model🔥 We present STream3R, which reformulates dense 3D/4D reconstruction into a sequential registration task with **causal attention**. - Projects: https://t.co/zrLlvxJ0FJ - Code: https://t.co/ONYaJDrjhF - Model:
5
58
320
BTW, They released a deep dive on FP8 KVCache of main MLA. https://t.co/1jse6UM6rS so, actually ≈1/5 compared to FP8 dense MLA.
As expected, NSA is not compatible with MLA, so DeepSeek chose another method: use a smaller (d=128) attention (w/o value) as the indexer. Asymptotic cost ratio = 128/576. In addition, indexer uses FP8 while main MLA uses 16-bit, so = 64/576 = 1/9.
1
34
244
看来好多人都在熬夜肝,KTransformer 支持运行 Qwen3 啦! Xeon 铂金 4 代 + 4090 运行 Qwen3-235B-A22B 单个请求可以达到 13.8 token/s, 4个请求并行可以达到总计 24.4 token/s 地址: https://t.co/6BNAm6DSMK
3
27
138
今日暴论:从个人角度上来说,我喜欢 Dynamo 的编排方式,喜欢 llm-d 的 ModelService,喜欢 AIBrix 的网关架构,喜欢 vLLM 的 KVCache / PagedAttention 建议找个神级团队过来揉一揉
把最新最热 LLM 分布式推理的 infra llm-d https://t.co/le0dJPecf7 各个组件的代码全部撕完了 难以言喻,感觉缺太多东西了... 而且巨量依赖 Gateway API 的那个新的 Inference Extension 做路由(但又不做完全) 可能在架构和资源设计上有的地方考虑的还不如 Bento / Dynamo 周到
0
2
25
学到一个新的小技巧,跟 AI 对话的时候,尤其是多个context反复对话,一直保持一个session 的情况,如果你的算力非常紧张。那么能不带动态内容,尽量不要带动态内容,比如当前时间戳,会导致 KVCache 失效。 因为这样会导致每次生成的 token 序列不同,迫使模型重新计算整个序列的
5
10
102
Looking at some examples in torch and tinygrad, think I have the KVCache and MultiHeadAttention in tinygrad written out. Now to work on the transformer layers and stitch it all together
1
0
4
SGLang,lmsys的新推理框架。 后端主要是引入了新的KVCache机制提升速度,前端则引入了类似微软Guidance的机制更好的控制LLM输出。
We are thrilled to introduce SGLang, our next-generation interface and runtime for LLM inference! It greatly improves the execution and programming efficiency of complex LLM programs by co-designing the front-end language and back-end runtime. On the backend, we propose
6
12
51
vllm is extremely focused on KVCache rather than improving latencies. like partitioning a model across multiple GPUs can also improve latencies significantly but they just give away all the new memory to KVCache. is there any other framework that treats latency as a first class
0
0
2
CVE-2025-60455 Unsafe Deserialization vulnerability in Modular Max Serve before 25.6, specifically when the "--experimental-enable-kvcache-agent" feature is used allowing attackers …
0
0
0
9070XT当前运行llama.cpp的性能/效率如图(开启FA并且使用q8_0的kvcache量化) ROCm尚未正式支持RDNA4,需要dev分支rocWMMA/hipBLASLt并对llama.cpp进行修改 可以看到虽然整体效率比起RDNA3已经有明显改进,但依然有进步空间。考虑到目前单独测试hipBLASLt性能也不太理想,此处需要高情商:未来可期
1
1
28
🚀𝗠𝗼𝗼𝗻𝗰𝗮𝗰𝗸𝗲 X 𝗟𝗠𝗖𝗮𝗰𝗵𝗲: KV Cache-centric Language Model Serving 🚀 We're thrilled to announce a strategic collaboration between LMCache and Mooncake to pioneer a KVCache-centric Large Language Model (LLM) serving system! This partnership is set to redefine
1
10
22
👀 Large Model Inference on Hetergeneous chips using vLLM, by @kerthcet (@daocloud_io) challenges for LLM inference: heterogeneous clusters: - variety of chips - high-end GPU sold out - perf/cost tradeoffs AI gateway: needs to be aware of - LLM-wise routing, KVCache, LoRA
2
3
7