
Stephen Panaro
@flat
Followers
531
Following
1K
Media
130
Statuses
889
making coffee and other things. @BrewTimerApp
boston
Joined May 2013
Turns out you don’t need R₅⁻¹ at all. 🫠 Fusing into Q and K is enough!. Cool paper from Qualcomm explains this and a few similar transforms. No code in the paper, so gist proof👇.
Liking the line of research where you multiply LLM weights by rotation matrices and the model still works. Most do it in between layers, but you can also sneak one between Q/K and RoPE. Extra parameters? None. Useful? …Maybe. Cool? I think so. (See R₅ below.)
1
0
5
Wondering if the tiny codebook (16 elements) opens any opportunities for GPU kernels (or if the scaling vectors negate it).
Figured out 4-bit /per-tensor/ quantization for Qwen2.5-0.5B. It’s on par with per-group GPTQ which is kinda cool (tbh non-uniform helps a lot). 🖇️Weights, evals, more details below.
0
0
4