flat Profile Banner
Stephen Panaro Profile
Stephen Panaro

@flat

Followers
531
Following
1K
Media
130
Statuses
889

making coffee and other things. @BrewTimerApp

boston
Joined May 2013
Don't wanna be here? Send us removal request.
@flat
Stephen Panaro
9 years
“We won’t run it in digital because we’re purists and maniacs.”.
2
0
5
@flat
Stephen Panaro
20 days
🐙: 📄: (R₅ is a rotation matrix, so its transpose is its inverse and it naturally cancels out in Q@K.T).
0
0
1
@flat
Stephen Panaro
20 days
Turns out you don’t need R₅⁻¹ at all. 🫠 Fusing into Q and K is enough!. Cool paper from Qualcomm explains this and a few similar transforms. No code in the paper, so gist proof👇.
@flat
Stephen Panaro
4 months
Liking the line of research where you multiply LLM weights by rotation matrices and the model still works. Most do it in between layers, but you can also sneak one between Q/K and RoPE. Extra parameters? None. Useful? …Maybe. Cool? I think so. (See R₅ below.)
Tweet media one
1
0
5
@flat
Stephen Panaro
21 days
The python library is interesting too. “Download files”:.
0
1
0
@flat
Stephen Panaro
21 days
See for yourself:. 1. Get the adapter training toolkit: 2. Clone: 3. Edit . - delete all functions except the first. - rename it to: func main<ios18>(.4. Follow readme to start netron, and open the .mil.
1
1
4
@flat
Stephen Panaro
21 days
Curious about the Apple Foundation Model architecture? I updated my netron fork to visualize the draft model*. *they say it might differ from the real model but looks convincing to me.
1
2
9
@flat
Stephen Panaro
21 days
Cool to see folks measuring KL too.
0
0
1
@flat
Stephen Panaro
21 days
btw, you can quantize the “hard-to-quantize” Llama 3.1 8B now. (LDLQ is GPTQ)
Tweet media one
1
0
1
@flat
Stephen Panaro
26 days
Wonder if we’re gonna get a new version of coremltools. Last year it dropped on Monday.
1
0
4
@flat
Stephen Panaro
27 days
0
0
0
@flat
Stephen Panaro
27 days
Looks like a lot. The weights are there (in 32bit) and there’s a python package to load them.
@flat
Stephen Panaro
27 days
Either way, wonder how much we can learn about the model from this.
2
0
5
@flat
Stephen Panaro
27 days
Download link doesn’t seem to be working yet.
2
0
1
@flat
Stephen Panaro
27 days
Either way, wonder how much we can learn about the model from this.
2
0
1
@flat
Stephen Panaro
27 days
Why not just release the weights at this point?
Tweet media one
Tweet media two
1
2
8
@flat
Stephen Panaro
28 days
WWDC wishes (all long shots):.- low-level ANE access (a la kernels).- actual quantized activations (for KV cache).- CoreML fast Hadamard transform.- share weights between CoreML and MLX (or MLX ANE backend).- ANE HW metrics: GB/s, FLOPs.
8
2
39
@flat
Stephen Panaro
1 month
Wondering if the tiny codebook (16 elements) opens any opportunities for GPU kernels (or if the scaling vectors negate it).
@flat
Stephen Panaro
1 month
Figured out 4-bit /per-tensor/ quantization for Qwen2.5-0.5B. It’s on par with per-group GPTQ which is kinda cool (tbh non-uniform helps a lot). 🖇️Weights, evals, more details below.
Tweet media one
0
0
4
@flat
Stephen Panaro
1 month
Weights/Evals: Details:.• Scaling vectors (ala OneBit). • Bias on all linears. • Block, but not model, finetuning. • Takes 1.5h on M1 Max. • Untested on other models :).
0
0
0
@flat
Stephen Panaro
1 month
Figured out 4-bit /per-tensor/ quantization for Qwen2.5-0.5B. It’s on par with per-group GPTQ which is kinda cool (tbh non-uniform helps a lot). 🖇️Weights, evals, more details below.
Tweet media one
1
0
2
@flat
Stephen Panaro
2 months
Seems like there are no per-tensor LLM quants. Too challenging? No speedup opportunity? Grouping is just very bit efficient?.
2
0
3
@flat
Stephen Panaro
3 months
Sure I’m still making a bunch of noob mistakes. And it’s quite messy. But if you’re curious. Repo: Kernel starts here:
0
0
2
@flat
Stephen Panaro
3 months
Have further tuned my lil’ quantization kernel. 0.3s (original, PyTorch).0.12s (+MLX) .0.09s (+tiny Metal kernel).0.051s (now, big fused kernel go brr). Speedup is >2x for larger matrices. h/t MLX learned/stole a lot from the mv kernel
Tweet media one
2
1
19