borisdayma Profile Banner
Boris Dayma πŸ–οΈ Profile
Boris Dayma πŸ–οΈ

@borisdayma

Followers
14K
Following
5K
Media
354
Statuses
2K

πŸ–οΈ Founder of Craiyon πŸ₯‘ Author of dalle-mini

Joined February 2012
Don't wanna be here? Send us removal request.
@borisdayma
Boris Dayma πŸ–οΈ
1 year
πŸ₯³ Release of the open-source reproduction of CapPa.πŸŽ‰ Beats every open-source model on all but one benchmark of SugarCrepe.πŸ“• Apache 2 model weights.πŸ‘“ Can be used as strong vision model for downstream tasks. πŸ‘‰ Let’s dive in.
6
56
246
@borisdayma
Boris Dayma πŸ–οΈ
22 hours
MUP worked quite well πŸ₯³.1/ search over tiny model found optimal LR range, selected middle of range on log scale (1e-3 here).2/ scale up with mup and keep same LR. I compare "large - mup tuned" to my "large - tuned" baseline and get same perf, which is good. The large baseline
Tweet media one
Tweet media two
@borisdayma
Boris Dayma πŸ–οΈ
3 days
MUP has been on my mind forever!.Now I came across this gem from @JesseFarebro : It automatically handles it on JAX/Flax 😍. Just need to see what to adjust for Muon / Shampoo / PSGD-kron (init params + LR scaling).
1
0
16
@borisdayma
Boris Dayma πŸ–οΈ
3 days
MUP has been on my mind forever!.Now I came across this gem from @JesseFarebro : It automatically handles it on JAX/Flax 😍. Just need to see what to adjust for Muon / Shampoo / PSGD-kron (init params + LR scaling).
2
4
64
@borisdayma
Boris Dayma πŸ–οΈ
8 days
Obvious new contender to replace old T5 embeddings.
@osanseviero
Omar Sanseviero
8 days
Introducing T5Gemma: the next generation of encoder-decoder/T5 models!. πŸ”§Decoder models adapted to be encoder-decoder.πŸ”₯32 models with different combinations.πŸ€—Available in Hugging Face and Kaggle.
Tweet media one
0
0
9
@borisdayma
Boris Dayma πŸ–οΈ
9 days
Gemini-CLI:.- Let’s implement this complex feature.- Ok I checked and it all works, now let me revert all the changes. Happened a few times where the intermediate state is perfect and for some reason Gemini decides to undo everything. Claude Code couldn’t fix those bugs.
0
0
2
@borisdayma
Boris Dayma πŸ–οΈ
22 days
I would compare Gemini 2.5 Flash Lite with Gemini 2.0 Flash since they are at the same price per token. 2.5 Flash is much more expensive so is in another category. Based on this index, 2.5 lite is not as good. And it uses more tokens (increase further $$). Will wait for now…
Tweet media one
@ArtificialAnlys
Artificial Analysis
29 days
Google Gemini 2.5 Flash-Lite is very verbose in non-reasoning and reasoning modes, using a significantly higher number of output tokens than even frontier models. However, Gemini 2.5 Flash-Lite is more price competitive than Gemini 2.5 Flash.
Tweet media one
Tweet media two
0
0
2
@borisdayma
Boris Dayma πŸ–οΈ
29 days
Is there a way to have muon make its heavy computations (ortho) done only every n steps?.I'm playing with it but it is much slower than Adam. The trick for shampoo is to do the slow compute only every 10 or even 100 steps.
2
1
10
@borisdayma
Boris Dayma πŸ–οΈ
29 days
RT @giffmana: Gemini 2.5 paper TL;DR. Technical part in thread. Contributors: ~1k.2.5 Pro timed out counting after 600s.2.5 Flash counts 1….
0
69
0
@borisdayma
Boris Dayma πŸ–οΈ
30 days
For reference, this was the pricing before GA. So my expectation at the time was a 50% cost increase vs Gemini 2.0 flash but ok because 2.5 is better than 2.0.
Tweet media one
1
0
4
@borisdayma
Boris Dayma πŸ–οΈ
30 days
"Based on developer feedback"?.Who asked for this "simplified pricing" of Gemini 2.5 flash in non-thinking mode:.- input $0.15 -> $0.30.- output $0.60 -> $2.50.Too bad, I was really excited about the GA 😒.
@OfficialLoganK
Logan Kilpatrick
30 days
The Gemini 2.5 Flash 05-20 variant is now the stable model we plan to support long term for Flash, and based on developer feedback, we have simplified the pricing and introduced an even smaller variant optimized for cost. (4/N)
Tweet media one
2
1
13
@borisdayma
Boris Dayma πŸ–οΈ
1 month
My data processing was going to take all day so I decided to optimize it. Spent all day on it but almost done and new code should now take less than an hour 🀞.Btw I’ll likely never have to use it ever again.
4
1
16
@borisdayma
Boris Dayma πŸ–οΈ
1 month
Something that should be improved wrt JAX way of training (with optax) is that you should not recompile anything just by changing the learning rate function (add new warmup, start decay…). Passing the learning rate value as part of the compiled function would make more sense.
1
1
9
@borisdayma
Boris Dayma πŸ–οΈ
2 months
But then again sometimes problems just fix themselves… You just need to ignore what happened 🫣. My workflow when things are bad:.1/ resume from latest checkpoint.2/ same but much earlier checkpoint.3/ same but add a warmup on learning rate. In this instance, 1/ and 2/ failed. I
Tweet media one
0
0
1
@borisdayma
Boris Dayma πŸ–οΈ
2 months
I like that idea!.When you get a bad spike you typically revert a checkpoint earlier. But sometimes it keeps happening so you revert further. Doing an average of few recent checkpoints (if EMA not available) instead seems like a good idea.
@giffmana
Lucas Beyer (bl16)
2 months
Another interesting thing they observe, is that it seems merged models are more stable, somehow?. Left, gradnorms of SFT on merge look a lot healthier, and right, they suggest to resume pre-training from a merge instead of a rollback when hitting a "killer" spike. Though they
Tweet media one
2
0
21
@borisdayma
Boris Dayma πŸ–οΈ
2 months
It's tricky to understand the level of "intelligence" of an AI model. Sometimes you get pleasantly surprised on fixing bugs, doing research, etc. Other times simple tasks for humans fail (count "r" in strawberry, make me a comparison table based on these 2 web pages).
0
0
4
@borisdayma
Boris Dayma πŸ–οΈ
3 months
From OpenAI image model pricing we know the number of tokens per image quality setting:.- small -> 32x32 tokens (same as Dalle-1).- medium -> 64x64 tokens.- large -> 128x128 tokens. Doing some tests on small setting we can see they have a pretty good image encoder. Also quite.
1
1
29
@borisdayma
Boris Dayma πŸ–οΈ
4 months
Overall I’m still quite impressed but I don’t think I can use it yet, except for simple functionality. My preferred way is to do "complex" things myself and "simple" things by asking multiple LLM (ChatGPT/Gemini/Claude) and review/compare their answer. Claude Code is still way.
1
0
4
@borisdayma
Boris Dayma πŸ–οΈ
4 months
In the end, I always had some comments on the proposed edits (keep my comments, wrong logic, don’t do new files) and Claude would just instantly give up trying to edit those files at all!. It continued "simmering" for 6 more minutes ($$) without any proposed change and without.
1
0
1
@borisdayma
Boris Dayma πŸ–οΈ
4 months
Introduction of hard to find bugs:.- πŸ‘Ž it refactored a file that has a complex logic. At first the code looked really impressive, clean, well documented but digging into it I could see it introduced bugs and changed my workflow while I had said that the code was already bug free.
1
0
2
@borisdayma
Boris Dayma πŸ–οΈ
4 months
Experience of considering my feedback was not great:.- πŸ‘Ž it suggested a nice reorganization of a file but removed important comments so I selected not to do the edit and said "this is good but you removed important comments that I would like to keep". He acknowledged but just.
1
0
1
@borisdayma
Boris Dayma πŸ–οΈ
4 months
It was painful to prevent it from creating too many files:.- happened at my first attempt which is not what I wanted (personal preference) but I had not explicitly asked for it so I restarted it (and cleared history).- πŸ‘Ž after I clarified my request and ask to minimize the.
1
0
1