Guangxuan Xiao Profile
Guangxuan Xiao

@Guangxuan_Xiao

Followers
1,113
Following
524
Media
10
Statuses
66

Ph.D. student at @MITEECS Prev: CS & Finance @Tsinghua_Uni

Cambridge, MA
Joined February 2020
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@Guangxuan_Xiao
Guangxuan Xiao
9 months
Meet StreamingLLM! Use LLMs for infinite input streams without losing efficiency and performance. Now, you can build a ChatBot that persistently works on your recent chats! 📄 Paper: 🔧 Code:
20
162
773
@Guangxuan_Xiao
Guangxuan Xiao
7 months
Insightful blog () shows attention sinks also exist in models like BERT, proving they're common in Transformers. Alongside 'ViTs need registers' findings, it questions why attention sinks emerge and how we can leverage them for Transformer optimization.
2
9
77
@Guangxuan_Xiao
Guangxuan Xiao
11 months
I will present SmoothQuant’s poster at #ICML2023 Exhibit Hall 1, 2pm tomorrow (Wed). SmoothQuant is a W8A8 quantization method for LLMs, suitable for reducing large-scale deployment costs. It has been adopted in many industrial systems. Glad to chat more with you then!
Tweet media one
0
7
64
@Guangxuan_Xiao
Guangxuan Xiao
1 year
Offsite-Tuning is a new framework for fine-tuning foundation models, which allows for fine-tuning without exchanging full models and data. With this method, the fine-tuning of large models like ChatGPT can be efficiently done without privacy leaks! @jilin_14 @songhan_mit
@_akhaliq
AK
1 year
Offsite-Tuning: Transfer Learning without Full Model abs: github:
Tweet media one
4
30
141
0
17
58
@Guangxuan_Xiao
Guangxuan Xiao
9 months
Attention sinks echos with @EvMill 's "SoftMax-off-by-One" and the paper "ViTs need registers". We all highlight potential SoftMax issues in Attention.
Tweet media one
1
2
34
@Guangxuan_Xiao
Guangxuan Xiao
6 months
Exciting news: StreamingLLM is now available on iPhone! 🎉 A huge thanks to @davidpissarra for his fantastic extension to our work. Can't wait to explore the possibilities with StreamingLLM!
@davidpissarra
David Pissarra
6 months
Run the Mistral-7B-Instruct-v0.2 model on iPhone! Supports now StreamingLLM for endless generation. Try the MLC Chat App via TestFlight For native LLM deployment, attention sinks are particularly helpful for longer generation with less memory requirement.
Tweet media one
Tweet media two
3
16
74
1
3
35
@Guangxuan_Xiao
Guangxuan Xiao
1 year
We've just updated the MT-NLG 530B model results for SmoothQuant. SmoothQuant enables single-server (8xA100) inference of the 530B model without compromising accuracy and efficiency. This reduces LLM serving costs by at least 50%!
Tweet media one
@songhan_mit
Song Han
2 years
How to efficiently deploy large language models (LLMs)? Quantization can help. But LLMs' activations are hard to quantize. SmoothQuant enables 8-bit weights and 8-bit activations for LLMs, achieving faster inference with half number of GPUs: (1/8)
Tweet media one
3
25
103
0
4
28
@Guangxuan_Xiao
Guangxuan Xiao
9 months
We find an "attention sink" phenomenon: retaining the KV of initial tokens can restore window attention performance. Interestingly, these tokens have strong attention scores, acting as a "sink," even if they lack semantic importance.
Tweet media one
2
5
23
@Guangxuan_Xiao
Guangxuan Xiao
9 months
Based on attention sinks, we propose StreamingLLM, an efficient framework that enables LLMs to accept super-long input streams without fine-tuning. StreamingLLM enables Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4M tokens!
Tweet media one
1
3
22
@Guangxuan_Xiao
Guangxuan Xiao
9 months
In StreamingLLM, we believe the rise of attention sinks links to the SoftMax function. It demands attention scores sum to one, even if there's no strong match in prior tokens. Thus, models "dump" excess attention into attention sinks.
1
2
19
@Guangxuan_Xiao
Guangxuan Xiao
8 months
Excited to see StreamingLLM is now in Intel's library!
@HaihaoShen
Haihao Shen
8 months
📢StreamingLLM landed in Intel Extension for Transformers to support LLM inference infinity on CPU, up to 4M tokens! 🎯Check out the code: , search "StreamingLLM" and have a try! #oneapi @intel @huggingface @Guangxuan_Xiao @_akhaliq
1
42
186
1
0
20
@Guangxuan_Xiao
Guangxuan Xiao
9 months
Previously, using LLMs for endless chats was hard. 1. Caching all past tokens' KV consumes too much memory. 2. When chat length exceeds the pre-training length, performance plunges. Window attention only saves recent tokens' KV, which fails as initial tokens' KV are evicted.
Tweet media one
3
3
18
@Guangxuan_Xiao
Guangxuan Xiao
9 months
This research was partly done during my great summer internship at @AIatMeta , with amazing collaborators including @tydsh , @BeidiChen , @ml_perception , and my advisor @songhan_mit !
1
2
10
@Guangxuan_Xiao
Guangxuan Xiao
1 year
Excited to see SmoothQuant is now available in Intel Neural Compressor!
@HaihaoShen
Haihao Shen
1 year
🔥Happy to announce SmoothQuant is now available in Intel Neural Compressor: . 🎯Checkout INT8 LLM models and get the significant performance speedup and model size reduction on Intel platforms. #oneAPI @songhan_mit @MosheWasserblat
0
10
29
0
0
8
@Guangxuan_Xiao
Guangxuan Xiao
6 months
@davidpissarra Also, thanks to the amazing work of the #mlcllm team!
0
0
7
@Guangxuan_Xiao
Guangxuan Xiao
2 years
Excited to share our work for LLM deplѹment: SmoothQuant! We enable quantiz㏌g 100B+ LLMs with 8bit weight&activation.Now we have faster ㏌ference with half of the resources! , Many thanks to @jilin_14 , @nvidia and @SongHan_Omni !
@songhan_mit
Song Han
2 years
How to efficiently deploy large language models (LLMs)? Quantization can help. But LLMs' activations are hard to quantize. SmoothQuant enables 8-bit weights and 8-bit activations for LLMs, achieving faster inference with half number of GPUs: (1/8)
Tweet media one
3
25
103
0
0
5
@Guangxuan_Xiao
Guangxuan Xiao
9 months
@itsclivetime Hi Clive, thanks! I think our conclusion may not imply that since Longformer is mainly on encoder models, while our finding is on autoregressive models! We think it’s better to interpret attention sinks as attention stabilizers, instead of tokens that encode global information.
0
0
2
@Guangxuan_Xiao
Guangxuan Xiao
2 years
Thank you, Tim!
@Tim_Dettmers
Tim Dettmers
2 years
Reading the SmoothQuant paper (), which is quite ingenious and wanted to share. Since matmul, A*B=C, is linear, we can shift information in A or B around. As such, we can balance the quantization difficulty across both matrices leading to great performance!
Tweet media one
2
13
119
0
0
1
@Guangxuan_Xiao
Guangxuan Xiao
7 months
@KyriectionZhang Interesting study! I am curious about the performance of using 128 + 128 for StreamingLLM on the summarization task😄
2
0
1