Albert Tseng @tsengalb99 X Profile

Albert Tseng

@tsengalb99

Followers

697

Following

88

Media

18

Statuses

83

CS PhD Student @ Cornell

Joined June 2022

Don't wanna be here? Send us removal request.

Albert Tseng

@tsengalb99

5 months

Excited to announce our #AISTATS📜on training LLMs with MXFP4! We use stoch. rounding and random Hadamard transforms (all fast on HW) to get low-variance, unbiased gradient estimates with MXFP4 GEMMs. We get a ~30% speedup over FP8 with almost no PPL gap!.

1

8

22

Albert Tseng

@tsengalb99

25 days

RT @yingheng_wang: ❓ Are LLMs actually problem solvers or just good at regurgitating facts?. 🚨New Benchmark Alert! We built HeuriGym to ben….

0

25

0

Albert Tseng

@tsengalb99

1 month

RT @ellisk_kellis: New paper: World models + Program synthesis by @topwasu.1. World modeling on-the-fly by synthesizing programs w/ 4000+ l….

0

105

0

Albert Tseng

@tsengalb99

1 month

RT @justachetan: I will be at #CVPR2025 presenting our work on differential operators for hybrid neural fields! Catch me at our poster:. 🗓️….

0

4

0

Albert Tseng

@tsengalb99

1 month

RT @simran_s_arora: Checkout CARTRIDGES, scaling cache-time compute! An alternative to ICL for settings where many different user messages….

0

4

0

Albert Tseng

@tsengalb99

1 month

RT @EyubogluSabri: When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a….

0

70

0

Albert Tseng

@tsengalb99

1 month

RT @tri_dao: Albert and co continue to do excellent work on quantization. This time the trick is to minimize KL wrt the original model, wit….

0

19

0

Albert Tseng

@tsengalb99

1 month

Apparently I chose the worst day to release a paper, so ICYMI, we made a post-training quantization algorithm that outperforms even @Google's quantization-aware training recipe. We beat the prior SOTA by >30%, meaning faster and smaller models. More details in the original 🧵👇

Albert Tseng

@tsengalb99

1 month

📣Introducing our latest work: Yet Another Quantization Algorithm!. YAQA directly minimizes the KL divergence to the original model during rounding, cutting it by >30% over prior PTQ methods and giving an even closer model than Google’s QAT on Gemma! 🤯. �

2

1

20

Albert Tseng

@tsengalb99

1 month

RT @JenJSun: VideoPrism is now available at: :).

0

4

0

Albert Tseng

@tsengalb99

1 month

RT @togethercompute: 5/ Quantized models don't need to lose fidelity. Check out our paper and blog for details:. 📝 Paper: .

0

2

0

Albert Tseng

@tsengalb99

1 month

RT @austinsilveria: chipmunk is up on arxiv!. across HunyuanVideo and Flux.1-dev, 5-25% of the intermediate activation values in attention….

0

7

0

Albert Tseng

@tsengalb99

1 month

RT @togethercompute: 🚀 New research: YAQA — Yet Another Quantization Algorithm (yes, pronounced like yaca/jackfruit 🥭). Led by @tsengalb99,….

0

5

0

Albert Tseng

@tsengalb99

1 month

@chrismdesa @togethercompute *.

0

1

7

Albert Tseng

@tsengalb99

1 month

@chrismdesa (6/6) We also have a blog post ( with @togethercompute, who graciously provided compute resources for this project!.

1

6

Albert Tseng

@tsengalb99

1 month

(5/6) All this results in a lower KL and SOTA downstream performance across a wide range of models and quantizers. For more information, check out our (w/Zhaofeng Sun & @chrismdesa) paper ( and code (.

1

9

Albert Tseng

@tsengalb99

1 month

(4/6) YAQA’s rounding algorithm comes with nice theoretical guarantees that allow us to reason about YAQA’s behavior vs. LDLQ. In fact, we show that LDLQ is a special case of YAQA that uses a provably worse Hessian than YAQA’s.

1

0

6

Albert Tseng

@tsengalb99

1 month

(3/6) YAQA solves this by quantizing to directly minimize the KL to the original model. YAQA first computes near-optimal Kronecker-factored Hessian approximations for the KL in a fully-distributed way, and then uses these Hessians in a new adaptive rounding algorithm.

1

0

6

Albert Tseng

@tsengalb99

1 month

(2/6) Existing quantization methods like GPTQ and LDLQ (QuIP, QuIP#, QTIP) typically round to minimize the immediate activation error. However, reducing this metric does not necessarily reduce the end to end KL and produce a closer model!.

1

0

5

Albert Tseng

@tsengalb99

1 month

📣Introducing our latest work: Yet Another Quantization Algorithm!. YAQA directly minimizes the KL divergence to the original model during rounding, cutting it by >30% over prior PTQ methods and giving an even closer model than Google’s QAT on Gemma! 🤯. �

6

26

100

Albert Tseng

@tsengalb99

1 month

RT @turboderp_: I made a thing.

0

25

0

Albert Tseng

@tsengalb99

1 month

Apparently ExLlama3 is based off of QTIP and I just found out today?!? 🤯.

1

0

7