Guangxuan Xiao @Guangxuan_Xiao profile

Guangxuan Xiao

@Guangxuan_Xiao

Followers

1,113

Following

524

Media

10

Statuses

66

Ph.D. student at @MITEECS Prev: CS & Finance @Tsinghua_Uni

https://t.co/mz2SOrejXI

Cambridge, MA

Joined February 2020

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

BLOOMING CONCEPT PHOTO • 278842 Tweets

BLOOMING CONCEPT CLIP • 247181 Tweets

BABY I'M A ROCKSTAR • 164691 Tweets

Ronaldo • 130392 Tweets

THE UNTOLD SHORT FILM • 108875 Tweets

Arda • 96840 Tweets

#VoleiNoSporTV • 78446 Tweets

Lewis • 75034 Tweets

Portekiz • 68532 Tweets

安室さん • 66972 Tweets

Turkey • 66363 Tweets

#SixTONESANN • 55003 Tweets

Japão • 48886 Tweets

#拓実とずっと一緒にいる悪魔の契約 • 45842 Tweets

Yunus • 40593 Tweets

Lando • 39574 Tweets

Mert • 33896 Tweets

Turquía • 32999 Tweets

なーくん • 32815 Tweets

Kerem • 31510 Tweets

Montella • 28612 Tweets

#TURvPOR • 28467 Tweets

#INDvsBAN • 26323 Tweets

ブラジル • 24314 Tweets

Kenan • 21821 Tweets

拓実くん • 21559 Tweets

Samet • 21028 Tweets

Altay • 19838 Tweets

フルセット • 19693 Tweets

Verstappen • 13689 Tweets

Hardik Pandya • 12494 Tweets

点描の唄

Abdülkerim

オウンゴール

Barış Alper

سيلفا

Ana Cristina

カンセロ

Orkun

Uğurcan

しめちゃん

THAISA

緑の魔女編

ポルトガル

Hamit

メダル確定

決勝進出

Cancelo

Bernardo Silva

女子バレー

Last Seen Profiles

@com_in_

@unioxford

@GarazDavid

@educators

@DarthGuyv

@RonChusid

@SueSkunk

@feriachiflodura

@MatthewMadrid

@j_bigboote

@RobbieHeeger

@na_nawebmagazin

@AmysMargin

@Calchexas

@Reabetswe_Ray

@PuertoReefaaa_

@BibisBeauty

@stalsmaths

@Punk_1976

@PremyslVaculik

Pinned Tweet

Guangxuan Xiao

@Guangxuan_Xiao

9 months

Meet StreamingLLM! Use LLMs for infinite input streams without losing efficiency and performance. Now, you can build a ChatBot that persistently works on your recent chats! 📄 Paper: 🔧 Code:

20

162

773

Guangxuan Xiao

@Guangxuan_Xiao

7 months

Insightful blog () shows attention sinks also exist in models like BERT, proving they're common in Transformers. Alongside 'ViTs need registers' findings, it questions why attention sinks emerge and how we can leverage them for Transformer optimization.

LLMs May Not Need Dense Self Attention

Sink Tokens and the Sparsity of Attention Scores in Transformer Models

link.medium.com

2

9

77

Guangxuan Xiao

@Guangxuan_Xiao

9 months

Huggingface paper page:

Paper page - Efficient Streaming Language Models with Attention Sinks

huggingface.co

2

10

66

Guangxuan Xiao

@Guangxuan_Xiao

11 months

I will present SmoothQuant’s poster at #ICML2023 Exhibit Hall 1, 2pm tomorrow (Wed). SmoothQuant is a W8A8 quantization method for LLMs, suitable for reducing large-scale deployment costs. It has been adopted in many industrial systems. Glad to chat more with you then!

0

7

64

Guangxuan Xiao

@Guangxuan_Xiao

1 year

Offsite-Tuning is a new framework for fine-tuning foundation models, which allows for fine-tuning without exchanging full models and data. With this method, the fine-tuning of large models like ChatGPT can be efficiently done without privacy leaks! @jilin_14 @songhan_mit

AK

@_akhaliq

1 year

Offsite-Tuning: Transfer Learning without Full Model abs: github:

4

30

141

0

17

58

Guangxuan Xiao

@Guangxuan_Xiao

9 months

Attention sinks echos with @EvMill 's "SoftMax-off-by-One" and the paper "ViTs need registers". We all highlight potential SoftMax issues in Attention.

1

2

34

Guangxuan Xiao

@Guangxuan_Xiao

6 months

Exciting news: StreamingLLM is now available on iPhone! 🎉 A huge thanks to @davidpissarra for his fantastic extension to our work. Can't wait to explore the possibilities with StreamingLLM!

David Pissarra

@davidpissarra

6 months

Run the Mistral-7B-Instruct-v0.2 model on iPhone! Supports now StreamingLLM for endless generation. Try the MLC Chat App via TestFlight For native LLM deployment, attention sinks are particularly helpful for longer generation with less memory requirement.

3

16

74

1

3

35

Guangxuan Xiao

@Guangxuan_Xiao

1 year

We've just updated the MT-NLG 530B model results for SmoothQuant. SmoothQuant enables single-server (8xA100) inference of the 530B model without compromising accuracy and efficiency. This reduces LLM serving costs by at least 50%!

Song Han

@songhan_mit

2 years

How to efficiently deploy large language models (LLMs)? Quantization can help. But LLMs' activations are hard to quantize. SmoothQuant enables 8-bit weights and 8-bit activations for LLMs, achieving faster inference with half number of GPUs: (1/8)

3

25

103

0

4

28

Guangxuan Xiao

@Guangxuan_Xiao

9 months

We find an "attention sink" phenomenon: retaining the KV of initial tokens can restore window attention performance. Interestingly, these tokens have strong attention scores, acting as a "sink," even if they lack semantic importance.

2

5

23

Guangxuan Xiao

@Guangxuan_Xiao

9 months

Based on attention sinks, we propose StreamingLLM, an efficient framework that enables LLMs to accept super-long input streams without fine-tuning. StreamingLLM enables Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4M tokens!

1

3

22

Guangxuan Xiao

@Guangxuan_Xiao

9 months

In StreamingLLM, we believe the rise of attention sinks links to the SoftMax function. It demands attention scores sum to one, even if there's no strong match in prior tokens. Thus, models "dump" excess attention into attention sinks.

1

2

19

Guangxuan Xiao

@Guangxuan_Xiao

8 months

Excited to see StreamingLLM is now in Intel's library!

Haihao Shen

@HaihaoShen

8 months

📢StreamingLLM landed in Intel Extension for Transformers to support LLM inference infinity on CPU, up to 4M tokens! 🎯Check out the code: , search "StreamingLLM" and have a try! #oneapi @intel @huggingface @Guangxuan_Xiao @_akhaliq

1

42

186

1

0

20

Guangxuan Xiao

@Guangxuan_Xiao

9 months

Previously, using LLMs for endless chats was hard. 1. Caching all past tokens' KV consumes too much memory. 2. When chat length exceeds the pre-training length, performance plunges. Window attention only saves recent tokens' KV, which fails as initial tokens' KV are evicted.

3

18

Guangxuan Xiao

@Guangxuan_Xiao

9 months

This research was partly done during my great summer internship at @AIatMeta , with amazing collaborators including @tydsh , @BeidiChen , @ml_perception , and my advisor @songhan_mit !

1

2

10

Guangxuan Xiao

@Guangxuan_Xiao

1 year

Excited to see SmoothQuant is now available in Intel Neural Compressor!

Haihao Shen

@HaihaoShen

1 year

🔥Happy to announce SmoothQuant is now available in Intel Neural Compressor: . 🎯Checkout INT8 LLM models and get the significant performance speedup and model size reduction on Intel platforms. #oneAPI @songhan_mit @MosheWasserblat

0

10

29

0

8

Guangxuan Xiao

@Guangxuan_Xiao

6 months

@davidpissarra Also, thanks to the amazing work of the #mlcllm team!

0

7

Guangxuan Xiao

@Guangxuan_Xiao

2 years

Excited to share our work for LLM deplѹment: SmoothQuant! We enable quantiz㏌g 100B+ LLMs with 8bit weight&activation．Now we have faster ㏌ference with half of the resources! , Many thanks to @jilin_14 , @nvidia and @SongHan_Omni !

GitHub - mit-han-lab/smoothquant: [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training...

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models - mit-han-lab/smoothquant

github.com

Song Han

@songhan_mit

2 years

How to efficiently deploy large language models (LLMs)? Quantization can help. But LLMs' activations are hard to quantize. SmoothQuant enables 8-bit weights and 8-bit activations for LLMs, achieving faster inference with half number of GPUs: (1/8)

3

25

103

0

5

Guangxuan Xiao

@Guangxuan_Xiao

7 months

For further reading, here are the "ViTs need registers" paper () and our "attention sink" paper ().

Efficient Streaming Language Models with Attention Sinks

Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly,...

arxiv.org

0

3

Guangxuan Xiao

@Guangxuan_Xiao

1 year

@yoavgo It is even possible to tune some parameters out of the model and plug it back 😄

Offsite-Tuning: Transfer Learning without Full Model

Transfer learning is important for foundation models to adapt to downstream tasks. However, many foundation models are proprietary, so users must share their data with model owners to fine-tune...

arxiv.org

1

0

2

Guangxuan Xiao

@Guangxuan_Xiao

9 months

@itsclivetime Hi Clive, thanks! I think our conclusion may not imply that since Longformer is mainly on encoder models, while our finding is on autoregressive models! We think it’s better to interpret attention sinks as attention stabilizers, instead of tokens that encode global information.

0

2

Guangxuan Xiao

@Guangxuan_Xiao

2 years

Thank you, Tim!

Tim Dettmers

@Tim_Dettmers

2 years

Reading the SmoothQuant paper (), which is quite ingenious and wanted to share. Since matmul, A*B=C, is linear, we can shift information in A or B around. As such, we can balance the quantization difficulty across both matrices leading to great performance!

2

13

119

0

1

Guangxuan Xiao

@Guangxuan_Xiao

7 months

@KyriectionZhang Interesting study! I am curious about the performance of using 128 + 128 for StreamingLLM on the summarization task😄

2

0

1