Jiaming Tang @jmtang42 X Profile

Jiaming Tang

@jmtang42

Followers

481

Following

665

Media

14

Statuses

26

Ph.D. student @MIT. I am interested in MLSys & Algo.

https://t.co/eqmMvSvhbo

Cambridge, MA

Joined January 2023

Don't wanna be here? Send us removal request.

Jiaming Tang

@jmtang42

6 days

On RTX5090, VLASH can reduce the control latency from ~530 ms to ~30 ms, achieving up to a 17× control latency reduction compared to synchronous inference. On RTX4090 and RTX5070, we can achieve ~15× and ~9× latency reduction, respectively. This low-latency control is essential

1

0

19

Jiaming Tang

@jmtang42

6 days

We also add a simple trick to make robots move even faster: “quantize” robot actions for speed. VLAs are trained on very fine-grained teleop data, so they output tiny action steps that are often more precise than necessary. VLASH groups every q fine-grained actions into one

1

12

Jiaming Tang

@jmtang42

6 days

Our solution: future-state-aware async inference. During asynchronous inference, we know which actions will run during the delay, so we can roll the robot state forward to its future state and feed the current observation and future state into the VLA, instead of the stale state

2

1

13

Jiaming Tang

@jmtang42

6 days

Asynchronous inference is a promising way to address these problems: Instead of waiting for the actions to finish, the robot can execute the actions while simultaneously performing inference for the next actions. However, naive asynchronous inference doesn’t fix it: the model

1

0

10

Jiaming Tang

@jmtang42

6 days

Problem: Today’s VLAs typically run with synchronous inference: 1️⃣ run the model to generate actions, 2️⃣ execute the actions, 3️⃣ then think again. This causes: • slow reactions: the robot can only respond after finishing action execution + another model inference • jittery,

1

0

10

Jiaming Tang

@jmtang42

6 days

Even large VLAs can play ping-pong in real time! 🏓⚡️ In practice, VLAs struggle with fast, dynamic tasks: • slow reactions, jittery actions. • demos often shown at 5-10× speed to look “smooth”. We introduce VLASH: • future-state-aware asynchronous inference with >30Hz

16

82

428

Guangxuan Xiao

@Guangxuan_Xiao

1 year

Introducing DuoAttention: Our new framework slashes both memory and latency for long-context LLMs without sacrificing performance! By applying full KV cache only to critical heads, we achieve: ⚡ 2.55x memory reduction ⚡ 2.18x decoding speedup ⚡ 3.3M tokens on a single A100 GPU

6

63

294

Jiaming Tang

@jmtang42

1 year

📄Paper: https://t.co/flH9rhwO2D 💻Code: https://t.co/yEx1UreXOf 🌍Website:

0

Jiaming Tang

@jmtang42

1 year

This research was done during my summer internship at @MIT, with amazing collaborators including @ylzhao_dreamer, Kan Zhu, @Guangxuan_Xiao, @bariskasikci and my advisor @songhan_mit!

1

0

2

Jiaming Tang

@jmtang42

1 year

Efficiency Comparison: For all sequence lengths, Quest significantly outperforms FlashInfer. Increasing the sequence lengths only slightly changes the latency of Quest. Quest speedup e2e inference by 2.23× with sequence length 30k, token budget 2k, 4bit AWQ weight quantization.

1

0

1

Jiaming Tang

@jmtang42

1 year

Accuracy of LongBench: We evaluate LongChat-7b-v1.5-32k across a wide range of long-context datasets. Quest with a budget of 2k tokens can achieve comparable performance with a full KV cache, while other baselines still exhibit a notable gap from full cache performance.

1

0

2

Jiaming Tang

@jmtang42

1 year

Accuracy of Needle-in-a-Haystack: (i) 10k length passkey retrieval test on LongChat-7b-v1.5-32k. (ii) 100k length passkey retrieval test on Yarn-Llama-2-7b-128k. Quest achieves nearly perfect accuracy with a KV cache of 64 and 1024 tokens (about 1% of the total length).

1

0

2

Jiaming Tang

@jmtang42

1 year

However, for an efficient selection of pages, we should calculate an approximate attention score following this insight. We found that the upper bound attention weights within a page can be used to approximate the highest attention in the page.

1

0

1

Jiaming Tang

@jmtang42

1 year

Key Idea: Preserve the entire KV cache and accelerate inference by reducing memory movement to selected constant K pages. To avoid missing critical tokens, we would like to select pages containing the token with the highest attention weights.

1

0

2

Jiaming Tang

@jmtang42

1 year

The criticality of the token is dynamic and highly dependent on the current query Q. For example, the token 'B' is critical to the current query 'is' and has a high attention score, but before 'is', 'B' is not critical and has a low attention score.

1

0

1

Jiaming Tang

@jmtang42

1 year

This slowdown is primarily caused by loading a large KV cache during attention. Previous methods have tried to compress the KV cache, deciding which tokens to discard based on historical information. However, this can lead to the loss of important tokens for future tokens.

1

0

2

Jiaming Tang

@jmtang42

1 year

As the demand for long-context LLMs increases, models with context windows of up to 1M tokens are becoming increasingly prevalent. However, long-context LLM inference is challenging since the inference speed decreases significantly as the sequence length grows.

1

0

2

Jiaming Tang

@jmtang42

1 year

🚀Excited to introduce Quest: an efficient long-context LLM inference framework, accepted by ICML 2024!🌟 ⚡️Quest leverages query-aware sparsity to achieve up to 2.23× e2e speedup for long-context LLM inference. 📄Paper: https://t.co/flH9rhwgd5 💻Code: https://t.co/HOJSzjiQX1

4

21

77

Xialin He

@Xialin_He

2 years

Finally! Love to see this work out! It’s an honor to work with this team! OmniH2O is a framework for robust and scalable humanoid teleoperation and autonomy! Check out our website:

Tairan He ✈️ NeurIPS 2025

@TairanHe99

2 years

Introduce OmniH2O, a learning-based system for whole-body humanoid teleop and autonomy: 🦾Robust loco-mani policy 🦸Universal teleop interface: VR, verbal, RGB 🧠Autonomy via @chatgpt4o or imitation 🔗Release the first whole-body humanoid dataset https://t.co/XRxdXIVbKv

1

6

25

Song Han

@songhan_mit

2 years

Congrats to AWQ authors for MLSys acceptance. AWQ is adopted by NVIDIA "Chat with RTX" and TensorRT-LLM for quantizing and accelerating LLMs.

4

15

118