Boris Dayma 🖍️ @borisdayma X Profile

Boris Dayma 🖍️

@borisdayma

Followers

14K

Following

5K

Media

354

Statuses

2K

🖍️ Founder of Craiyon 🥑 Author of dalle-mini

Joined February 2012

Don't wanna be here? Send us removal request.

Boris Dayma 🖍️

@borisdayma

1 year

🥳 Release of the open-source reproduction of CapPa.🎉 Beats every open-source model on all but one benchmark of SugarCrepe.📕 Apache 2 model weights.👓 Can be used as strong vision model for downstream tasks. 👉 Let’s dive in.

6

56

246

Boris Dayma 🖍️

@borisdayma

22 hours

MUP worked quite well 🥳.1/ search over tiny model found optimal LR range, selected middle of range on log scale (1e-3 here).2/ scale up with mup and keep same LR. I compare "large - mup tuned" to my "large - tuned" baseline and get same perf, which is good. The large baseline

Boris Dayma 🖍️

@borisdayma

3 days

MUP has been on my mind forever!.Now I came across this gem from @JesseFarebro : It automatically handles it on JAX/Flax 😍. Just need to see what to adjust for Muon / Shampoo / PSGD-kron (init params + LR scaling).

1

0

16

Boris Dayma 🖍️

@borisdayma

3 days

MUP has been on my mind forever!.Now I came across this gem from @JesseFarebro : It automatically handles it on JAX/Flax 😍. Just need to see what to adjust for Muon / Shampoo / PSGD-kron (init params + LR scaling).

2

4

64

Boris Dayma 🖍️

@borisdayma

8 days

Obvious new contender to replace old T5 embeddings.

Omar Sanseviero

@osanseviero

8 days

Introducing T5Gemma: the next generation of encoder-decoder/T5 models!. 🔧Decoder models adapted to be encoder-decoder.🔥32 models with different combinations.🤗Available in Hugging Face and Kaggle.

0

9

Boris Dayma 🖍️

@borisdayma

9 days

Gemini-CLI:.- Let’s implement this complex feature.- Ok I checked and it all works, now let me revert all the changes. Happened a few times where the intermediate state is perfect and for some reason Gemini decides to undo everything. Claude Code couldn’t fix those bugs.

0

2

Boris Dayma 🖍️

@borisdayma

22 days

I would compare Gemini 2.5 Flash Lite with Gemini 2.0 Flash since they are at the same price per token. 2.5 Flash is much more expensive so is in another category. Based on this index, 2.5 lite is not as good. And it uses more tokens (increase further $$). Will wait for now…

Artificial Analysis

@ArtificialAnlys

29 days

Google Gemini 2.5 Flash-Lite is very verbose in non-reasoning and reasoning modes, using a significantly higher number of output tokens than even frontier models. However, Gemini 2.5 Flash-Lite is more price competitive than Gemini 2.5 Flash.

0

2

Boris Dayma 🖍️

@borisdayma

29 days

Is there a way to have muon make its heavy computations (ortho) done only every n steps?.I'm playing with it but it is much slower than Adam. The trick for shampoo is to do the slow compute only every 10 or even 100 steps.

2

1

10

Boris Dayma 🖍️

@borisdayma

29 days

RT @giffmana: Gemini 2.5 paper TL;DR. Technical part in thread. Contributors: ~1k.2.5 Pro timed out counting after 600s.2.5 Flash counts 1….

0

69

0

Boris Dayma 🖍️

@borisdayma

30 days

For reference, this was the pricing before GA. So my expectation at the time was a 50% cost increase vs Gemini 2.0 flash but ok because 2.5 is better than 2.0.

1

0

4

Boris Dayma 🖍️

@borisdayma

30 days

"Based on developer feedback"?.Who asked for this "simplified pricing" of Gemini 2.5 flash in non-thinking mode:.- input $0.15 -> $0.30.- output $0.60 -> $2.50.Too bad, I was really excited about the GA 😢.

Logan Kilpatrick

@OfficialLoganK

30 days

The Gemini 2.5 Flash 05-20 variant is now the stable model we plan to support long term for Flash, and based on developer feedback, we have simplified the pricing and introduced an even smaller variant optimized for cost. (4/N)

2

1

13

Boris Dayma 🖍️

@borisdayma

1 month

My data processing was going to take all day so I decided to optimize it. Spent all day on it but almost done and new code should now take less than an hour 🤞.Btw I’ll likely never have to use it ever again.

4

1

16

Boris Dayma 🖍️

@borisdayma

1 month

Something that should be improved wrt JAX way of training (with optax) is that you should not recompile anything just by changing the learning rate function (add new warmup, start decay…). Passing the learning rate value as part of the compiled function would make more sense.

1

9

Boris Dayma 🖍️

@borisdayma

2 months

But then again sometimes problems just fix themselves… You just need to ignore what happened 🫣. My workflow when things are bad:.1/ resume from latest checkpoint.2/ same but much earlier checkpoint.3/ same but add a warmup on learning rate. In this instance, 1/ and 2/ failed. I

0

1

Boris Dayma 🖍️

@borisdayma

2 months

I like that idea!.When you get a bad spike you typically revert a checkpoint earlier. But sometimes it keeps happening so you revert further. Doing an average of few recent checkpoints (if EMA not available) instead seems like a good idea.

Lucas Beyer (bl16)

@giffmana

2 months

Another interesting thing they observe, is that it seems merged models are more stable, somehow?. Left, gradnorms of SFT on merge look a lot healthier, and right, they suggest to resume pre-training from a merge instead of a rollback when hitting a "killer" spike. Though they

2

0

21

Boris Dayma 🖍️

@borisdayma

2 months

It's tricky to understand the level of "intelligence" of an AI model. Sometimes you get pleasantly surprised on fixing bugs, doing research, etc. Other times simple tasks for humans fail (count "r" in strawberry, make me a comparison table based on these 2 web pages).

0

4

Boris Dayma 🖍️

@borisdayma

3 months

From OpenAI image model pricing we know the number of tokens per image quality setting:.- small -> 32x32 tokens (same as Dalle-1).- medium -> 64x64 tokens.- large -> 128x128 tokens. Doing some tests on small setting we can see they have a pretty good image encoder. Also quite.

1

29

Boris Dayma 🖍️

@borisdayma

4 months

Overall I’m still quite impressed but I don’t think I can use it yet, except for simple functionality. My preferred way is to do "complex" things myself and "simple" things by asking multiple LLM (ChatGPT/Gemini/Claude) and review/compare their answer. Claude Code is still way.

1

0

4

Boris Dayma 🖍️

@borisdayma

4 months

In the end, I always had some comments on the proposed edits (keep my comments, wrong logic, don’t do new files) and Claude would just instantly give up trying to edit those files at all!. It continued "simmering" for 6 more minutes ($$) without any proposed change and without.

1

0

1

Boris Dayma 🖍️

@borisdayma

4 months

Introduction of hard to find bugs:.- 👎 it refactored a file that has a complex logic. At first the code looked really impressive, clean, well documented but digging into it I could see it introduced bugs and changed my workflow while I had said that the code was already bug free.

1

0

2

Boris Dayma 🖍️

@borisdayma

4 months

Experience of considering my feedback was not great:.- 👎 it suggested a nice reorganization of a file but removed important comments so I selected not to do the edit and said "this is good but you removed important comments that I would like to keep". He acknowledged but just.

1

0

1

Boris Dayma 🖍️

@borisdayma

4 months

It was painful to prevent it from creating too many files:.- happened at my first attempt which is not what I wanted (personal preference) but I had not explicitly asked for it so I restarted it (and cleared history).- 👎 after I clarified my request and ask to minimize the.

1

0

1