Jiaming Tang
@jmtang42
Followers
481
Following
665
Media
14
Statuses
26
Ph.D. student @MIT. I am interested in MLSys & Algo.
Cambridge, MA
Joined January 2023
On RTX5090, VLASH can reduce the control latency from ~530 ms to ~30 ms, achieving up to a 17× control latency reduction compared to synchronous inference. On RTX4090 and RTX5070, we can achieve ~15× and ~9× latency reduction, respectively. This low-latency control is essential
1
0
19
We also add a simple trick to make robots move even faster: “quantize” robot actions for speed. VLAs are trained on very fine-grained teleop data, so they output tiny action steps that are often more precise than necessary. VLASH groups every q fine-grained actions into one
1
1
12
Our solution: future-state-aware async inference. During asynchronous inference, we know which actions will run during the delay, so we can roll the robot state forward to its future state and feed the current observation and future state into the VLA, instead of the stale state
2
1
13
Asynchronous inference is a promising way to address these problems: Instead of waiting for the actions to finish, the robot can execute the actions while simultaneously performing inference for the next actions. However, naive asynchronous inference doesn’t fix it: the model
1
0
10
Problem: Today’s VLAs typically run with synchronous inference: 1️⃣ run the model to generate actions, 2️⃣ execute the actions, 3️⃣ then think again. This causes: • slow reactions: the robot can only respond after finishing action execution + another model inference • jittery,
1
0
10
Even large VLAs can play ping-pong in real time! 🏓⚡️ In practice, VLAs struggle with fast, dynamic tasks: • slow reactions, jittery actions. • demos often shown at 5-10× speed to look “smooth”. We introduce VLASH: • future-state-aware asynchronous inference with >30Hz
16
82
428
Introducing DuoAttention: Our new framework slashes both memory and latency for long-context LLMs without sacrificing performance! By applying full KV cache only to critical heads, we achieve: ⚡ 2.55x memory reduction ⚡ 2.18x decoding speedup ⚡ 3.3M tokens on a single A100 GPU
6
63
294
📄Paper: https://t.co/flH9rhwO2D 💻Code: https://t.co/yEx1UreXOf 🌍Website:
0
0
0
This research was done during my summer internship at @MIT, with amazing collaborators including @ylzhao_dreamer, Kan Zhu, @Guangxuan_Xiao, @bariskasikci and my advisor @songhan_mit!
1
0
2
Efficiency Comparison: For all sequence lengths, Quest significantly outperforms FlashInfer. Increasing the sequence lengths only slightly changes the latency of Quest. Quest speedup e2e inference by 2.23× with sequence length 30k, token budget 2k, 4bit AWQ weight quantization.
1
0
1
Accuracy of LongBench: We evaluate LongChat-7b-v1.5-32k across a wide range of long-context datasets. Quest with a budget of 2k tokens can achieve comparable performance with a full KV cache, while other baselines still exhibit a notable gap from full cache performance.
1
0
2
Accuracy of Needle-in-a-Haystack: (i) 10k length passkey retrieval test on LongChat-7b-v1.5-32k. (ii) 100k length passkey retrieval test on Yarn-Llama-2-7b-128k. Quest achieves nearly perfect accuracy with a KV cache of 64 and 1024 tokens (about 1% of the total length).
1
0
2
However, for an efficient selection of pages, we should calculate an approximate attention score following this insight. We found that the upper bound attention weights within a page can be used to approximate the highest attention in the page.
1
0
1
Key Idea: Preserve the entire KV cache and accelerate inference by reducing memory movement to selected constant K pages. To avoid missing critical tokens, we would like to select pages containing the token with the highest attention weights.
1
0
2
The criticality of the token is dynamic and highly dependent on the current query Q. For example, the token 'B' is critical to the current query 'is' and has a high attention score, but before 'is', 'B' is not critical and has a low attention score.
1
0
1
This slowdown is primarily caused by loading a large KV cache during attention. Previous methods have tried to compress the KV cache, deciding which tokens to discard based on historical information. However, this can lead to the loss of important tokens for future tokens.
1
0
2
As the demand for long-context LLMs increases, models with context windows of up to 1M tokens are becoming increasingly prevalent. However, long-context LLM inference is challenging since the inference speed decreases significantly as the sequence length grows.
1
0
2
🚀Excited to introduce Quest: an efficient long-context LLM inference framework, accepted by ICML 2024!🌟 ⚡️Quest leverages query-aware sparsity to achieve up to 2.23× e2e speedup for long-context LLM inference. 📄Paper: https://t.co/flH9rhwgd5 💻Code: https://t.co/HOJSzjiQX1
4
21
77
Finally! Love to see this work out! It’s an honor to work with this team! OmniH2O is a framework for robust and scalable humanoid teleoperation and autonomy! Check out our website:
Introduce OmniH2O, a learning-based system for whole-body humanoid teleop and autonomy: 🦾Robust loco-mani policy 🦸Universal teleop interface: VR, verbal, RGB 🧠Autonomy via @chatgpt4o or imitation 🔗Release the first whole-body humanoid dataset https://t.co/XRxdXIVbKv
1
6
25
Congrats to AWQ authors for MLSys acceptance. AWQ is adopted by NVIDIA "Chat with RTX" and TensorRT-LLM for quantizing and accelerating LLMs.
4
15
118