Stephen Panaro @flat X Profile

Stephen Panaro

@flat

Followers

531

Following

1K

Media

130

Statuses

889

making coffee and other things. @BrewTimerApp

boston

Joined May 2013

Don't wanna be here? Send us removal request.

Stephen Panaro

@flat

9 years

“We won’t run it in digital because we’re purists and maniacs.”.

2

0

5

Stephen Panaro

@flat

20 days

🐙: 📄: (R₅ is a rotation matrix, so its transpose is its inverse and it naturally cancels out in Q@K.T).

0

1

Stephen Panaro

@flat

20 days

Turns out you don’t need R₅⁻¹ at all. 🫠 Fusing into Q and K is enough!. Cool paper from Qualcomm explains this and a few similar transforms. No code in the paper, so gist proof👇.

Stephen Panaro

@flat

4 months

Liking the line of research where you multiply LLM weights by rotation matrices and the model still works. Most do it in between layers, but you can also sneak one between Q/K and RoPE. Extra parameters? None. Useful? …Maybe. Cool? I think so. (See R₅ below.)

1

0

5

Stephen Panaro

@flat

21 days

The python library is interesting too. “Download files”:.

0

1

0

Stephen Panaro

@flat

21 days

See for yourself:. 1. Get the adapter training toolkit: 2. Clone: 3. Edit . - delete all functions except the first. - rename it to: func main<ios18>(.4. Follow readme to start netron, and open the .mil.

1

4

Stephen Panaro

@flat

21 days

Curious about the Apple Foundation Model architecture? I updated my netron fork to visualize the draft model*. *they say it might differ from the real model but looks convincing to me.

1

2

9

Stephen Panaro

@flat

21 days

Cool to see folks measuring KL too.

0

1

Stephen Panaro

@flat

21 days

btw, you can quantize the “hard-to-quantize” Llama 3.1 8B now. (LDLQ is GPTQ)

1

0

1

Stephen Panaro

@flat

26 days

Wonder if we’re gonna get a new version of coremltools. Last year it dropped on Monday.

1

0

4

Stephen Panaro

@flat

27 days

0

Stephen Panaro

@flat

27 days

Looks like a lot. The weights are there (in 32bit) and there’s a python package to load them.

Stephen Panaro

@flat

27 days

Either way, wonder how much we can learn about the model from this.

2

0

5

Stephen Panaro

@flat

27 days

Download link doesn’t seem to be working yet.

2

0

1

Stephen Panaro

@flat

27 days

Either way, wonder how much we can learn about the model from this.

2

0

1

Stephen Panaro

@flat

27 days

Why not just release the weights at this point?

1

2

8

Stephen Panaro

@flat

28 days

WWDC wishes (all long shots):.- low-level ANE access (a la kernels).- actual quantized activations (for KV cache).- CoreML fast Hadamard transform.- share weights between CoreML and MLX (or MLX ANE backend).- ANE HW metrics: GB/s, FLOPs.

8

2

39

Stephen Panaro

@flat

1 month

Wondering if the tiny codebook (16 elements) opens any opportunities for GPU kernels (or if the scaling vectors negate it).

Stephen Panaro

@flat

1 month

Figured out 4-bit /per-tensor/ quantization for Qwen2.5-0.5B. It’s on par with per-group GPTQ which is kinda cool (tbh non-uniform helps a lot). 🖇️Weights, evals, more details below.

0

4

Stephen Panaro

@flat

1 month

Weights/Evals: Details:.• Scaling vectors (ala OneBit). • Bias on all linears. • Block, but not model, finetuning. • Takes 1.5h on M1 Max. • Untested on other models :).

0

Stephen Panaro

@flat

1 month

Figured out 4-bit /per-tensor/ quantization for Qwen2.5-0.5B. It’s on par with per-group GPTQ which is kinda cool (tbh non-uniform helps a lot). 🖇️Weights, evals, more details below.

1

0

2

Stephen Panaro

@flat

2 months

Seems like there are no per-tensor LLM quants. Too challenging? No speedup opportunity? Grouping is just very bit efficient?.

2

0

3

Stephen Panaro

@flat

3 months

Sure I’m still making a bunch of noob mistakes. And it’s quite messy. But if you’re curious. Repo: Kernel starts here:

0

2

Stephen Panaro

@flat

3 months

Have further tuned my lil’ quantization kernel. 0.3s (original, PyTorch).0.12s (+MLX) .0.09s (+tiny Metal kernel).0.051s (now, big fused kernel go brr). Speedup is >2x for larger matrices. h/t MLX learned/stole a lot from the mv kernel

2

1

19