Jiaming Tang Profile
Jiaming Tang

@jmtang42

Followers
481
Following
665
Media
14
Statuses
26

Ph.D. student @MIT. I am interested in MLSys & Algo.

Cambridge, MA
Joined January 2023
Don't wanna be here? Send us removal request.
@jmtang42
Jiaming Tang
6 days
On RTX5090, VLASH can reduce the control latency from ~530 ms to ~30 ms, achieving up to a 17× control latency reduction compared to synchronous inference. On RTX4090 and RTX5070, we can achieve ~15× and ~9× latency reduction, respectively. This low-latency control is essential
1
0
19
@jmtang42
Jiaming Tang
6 days
We also add a simple trick to make robots move even faster: “quantize” robot actions for speed. VLAs are trained on very fine-grained teleop data, so they output tiny action steps that are often more precise than necessary. VLASH groups every q fine-grained actions into one
1
1
12
@jmtang42
Jiaming Tang
6 days
Our solution: future-state-aware async inference. During asynchronous inference, we know which actions will run during the delay, so we can roll the robot state forward to its future state and feed the current observation and future state into the VLA, instead of the stale state
2
1
13
@jmtang42
Jiaming Tang
6 days
Asynchronous inference is a promising way to address these problems: Instead of waiting for the actions to finish, the robot can execute the actions while simultaneously performing inference for the next actions. However, naive asynchronous inference doesn’t fix it: the model
1
0
10
@jmtang42
Jiaming Tang
6 days
Problem: Today’s VLAs typically run with synchronous inference: 1️⃣ run the model to generate actions, 2️⃣ execute the actions, 3️⃣ then think again. This causes: • slow reactions: the robot can only respond after finishing action execution + another model inference • jittery,
1
0
10
@jmtang42
Jiaming Tang
6 days
Even large VLAs can play ping-pong in real time! 🏓⚡️ In practice, VLAs struggle with fast, dynamic tasks: • slow reactions, jittery actions. • demos often shown at 5-10× speed to look “smooth”. We introduce VLASH: • future-state-aware asynchronous inference with >30Hz
16
82
428
@Guangxuan_Xiao
Guangxuan Xiao
1 year
Introducing DuoAttention: Our new framework slashes both memory and latency for long-context LLMs without sacrificing performance! By applying full KV cache only to critical heads, we achieve: ⚡ 2.55x memory reduction ⚡ 2.18x decoding speedup ⚡ 3.3M tokens on a single A100 GPU
6
63
294
@jmtang42
Jiaming Tang
1 year
0
0
0
@jmtang42
Jiaming Tang
1 year
This research was done during my summer internship at @MIT, with amazing collaborators including @ylzhao_dreamer, Kan Zhu, @Guangxuan_Xiao, @bariskasikci and my advisor @songhan_mit!
1
0
2
@jmtang42
Jiaming Tang
1 year
Efficiency Comparison: For all sequence lengths, Quest significantly outperforms FlashInfer. Increasing the sequence lengths only slightly changes the latency of Quest. Quest speedup e2e inference by 2.23× with sequence length 30k, token budget 2k, 4bit AWQ weight quantization.
1
0
1
@jmtang42
Jiaming Tang
1 year
Accuracy of LongBench: We evaluate LongChat-7b-v1.5-32k across a wide range of long-context datasets. Quest with a budget of 2k tokens can achieve comparable performance with a full KV cache, while other baselines still exhibit a notable gap from full cache performance.
1
0
2
@jmtang42
Jiaming Tang
1 year
Accuracy of Needle-in-a-Haystack: (i) 10k length passkey retrieval test on LongChat-7b-v1.5-32k. (ii) 100k length passkey retrieval test on Yarn-Llama-2-7b-128k. Quest achieves nearly perfect accuracy with a KV cache of 64 and 1024 tokens (about 1% of the total length).
1
0
2
@jmtang42
Jiaming Tang
1 year
However, for an efficient selection of pages, we should calculate an approximate attention score following this insight. We found that the upper bound attention weights within a page can be used to approximate the highest attention in the page.
1
0
1
@jmtang42
Jiaming Tang
1 year
Key Idea: Preserve the entire KV cache and accelerate inference by reducing memory movement to selected constant K pages. To avoid missing critical tokens, we would like to select pages containing the token with the highest attention weights.
1
0
2
@jmtang42
Jiaming Tang
1 year
The criticality of the token is dynamic and highly dependent on the current query Q. For example, the token 'B' is critical to the current query 'is' and has a high attention score, but before 'is', 'B' is not critical and has a low attention score.
1
0
1
@jmtang42
Jiaming Tang
1 year
This slowdown is primarily caused by loading a large KV cache during attention. Previous methods have tried to compress the KV cache, deciding which tokens to discard based on historical information. However, this can lead to the loss of important tokens for future tokens.
1
0
2
@jmtang42
Jiaming Tang
1 year
As the demand for long-context LLMs increases, models with context windows of up to 1M tokens are becoming increasingly prevalent. However, long-context LLM inference is challenging since the inference speed decreases significantly as the sequence length grows.
1
0
2
@jmtang42
Jiaming Tang
1 year
🚀Excited to introduce Quest: an efficient long-context LLM inference framework, accepted by ICML 2024!🌟 ⚡️Quest leverages query-aware sparsity to achieve up to 2.23× e2e speedup for long-context LLM inference. 📄Paper: https://t.co/flH9rhwgd5 💻Code: https://t.co/HOJSzjiQX1
4
21
77
@Xialin_He
Xialin He
2 years
Finally! Love to see this work out! It’s an honor to work with this team! OmniH2O is a framework for robust and scalable humanoid teleoperation and autonomy! Check out our website:
@TairanHe99
Tairan He ✈️ NeurIPS 2025
2 years
Introduce OmniH2O, a learning-based system for whole-body humanoid teleop and autonomy: 🦾Robust loco-mani policy 🦸Universal teleop interface: VR, verbal, RGB 🧠Autonomy via @chatgpt4o or imitation 🔗Release the first whole-body humanoid dataset https://t.co/XRxdXIVbKv
1
6
25
@songhan_mit
Song Han
2 years
Congrats to AWQ authors for MLSys acceptance. AWQ is adopted by NVIDIA "Chat with RTX" and TensorRT-LLM for quantizing and accelerating LLMs.
4
15
118