
Boris Dayma ποΈ
@borisdayma
Followers
14K
Following
5K
Media
354
Statuses
2K
ποΈ Founder of Craiyon π₯ Author of dalle-mini
Joined February 2012
MUP worked quite well π₯³.1/ search over tiny model found optimal LR range, selected middle of range on log scale (1e-3 here).2/ scale up with mup and keep same LR. I compare "large - mup tuned" to my "large - tuned" baseline and get same perf, which is good. The large baseline
MUP has been on my mind forever!.Now I came across this gem from @JesseFarebro : It automatically handles it on JAX/Flax π. Just need to see what to adjust for Muon / Shampoo / PSGD-kron (init params + LR scaling).
1
0
16
MUP has been on my mind forever!.Now I came across this gem from @JesseFarebro : It automatically handles it on JAX/Flax π. Just need to see what to adjust for Muon / Shampoo / PSGD-kron (init params + LR scaling).
2
4
64
I would compare Gemini 2.5 Flash Lite with Gemini 2.0 Flash since they are at the same price per token. 2.5 Flash is much more expensive so is in another category. Based on this index, 2.5 lite is not as good. And it uses more tokens (increase further $$). Will wait for nowβ¦
Google Gemini 2.5 Flash-Lite is very verbose in non-reasoning and reasoning modes, using a significantly higher number of output tokens than even frontier models. However, Gemini 2.5 Flash-Lite is more price competitive than Gemini 2.5 Flash.
0
0
2
RT @giffmana: Gemini 2.5 paper TL;DR. Technical part in thread. Contributors: ~1k.2.5 Pro timed out counting after 600s.2.5 Flash counts 1β¦.
0
69
0
"Based on developer feedback"?.Who asked for this "simplified pricing" of Gemini 2.5 flash in non-thinking mode:.- input $0.15 -> $0.30.- output $0.60 -> $2.50.Too bad, I was really excited about the GA π’.
The Gemini 2.5 Flash 05-20 variant is now the stable model we plan to support long term for Flash, and based on developer feedback, we have simplified the pricing and introduced an even smaller variant optimized for cost. (4/N)
2
1
13
I like that idea!.When you get a bad spike you typically revert a checkpoint earlier. But sometimes it keeps happening so you revert further. Doing an average of few recent checkpoints (if EMA not available) instead seems like a good idea.
Another interesting thing they observe, is that it seems merged models are more stable, somehow?. Left, gradnorms of SFT on merge look a lot healthier, and right, they suggest to resume pre-training from a merge instead of a rollback when hitting a "killer" spike. Though they
2
0
21