sway
@SwayStar123
Followers
1K
Following
9K
Media
282
Statuses
1K
Vegan btw, working on diffusion models 日本語できる, 中文学习中
Joined July 2015
Not only is guidance distillation almost a free lunch training wise (very cheap to distill), it also seems to improve the model to avoid some failure cases! 2x faster sampling + slightly better samples :)
7
2
46
Remember when people made fun of diffusion models for being unable to generate handshakes?
0
0
18
I dont think this is actually that big of a deal MathArena apex is basically picking out all the problems that majority of the then top LLMs could not solve. So gpt5 was gonna get a 0 by definition, any improvements are bound to look impressive in comparison
1
0
3
Trying out gemini 3 in cursor, its a lot more methodical in finding bugs. After finding a suspect, it will make a test file with minimal repro and useful logs and only then continue to fix it. Whereas claude/chatgpt will assume they are correct and move on to their fix, which
1
0
9
Windsurf is actually pretty good and possibly even better than cursor
0
0
8
cfg/guidance distillation seems to be pretty much free lunch. Why do no labs apart from BFL do it?
4
1
46
ml is so funny cuz sometimes you have bugs and the model still learns and sometimes you have no bugs and the model still doesnt learn
0
0
10
gpt 5.1 codex - garbage, absolutely broken in windsurf gpt 5.1 - pretty good, free so you can use even high thinking unlimited. But i think the thinking is not well integrated (it doesnt remember its previous thoughts) sonnet 4.5 - still the goat, need to use claude to solve bugs
0
1
7
Managed to single handedly revive a dead slack channel by sending the first message there in 3 years, and now suddenly everyone's using it
0
0
4
Anyone wanna give me 8xh100 for a few months? Aim is to get XL@100k = 2 FID or less Theres so much room for improvement I didnt have compute/time to test -Alternative optimizer (Muon, SOAP/Shampoo, Prodigy) -Newer techniques like SPRINT -Custom VAE (Can you outdo RAE with an
1
2
41
Black Myth Wukong is such a beautiful game I could make most of the random walkways my wallpapers
1
0
4
ok nvm tiny ae is unstable to train, switching to invae from repa-e, modifying for my purposes
0
0
1
claude casually just adding its whole thinking process to my code... Why does claude act like this when its in cursor
1
0
5
Trying out a tinyae arch f32c256 with REPA-E, aligning both the latents, encoder, and decoder. Also added structured latents loss and noise shifting based on channel size from RAE Img is diffusion model samples at 4k steps. Will eval at 100k to see if this is close to RAE
so RAE is really cool, but im thinking, even with their DDT head, scaling this to higher compression would be really hard (naively just increasing patch size and modelling 768*4+ dims) So autoencoders might not quite be dead yet for deep compression Im thinking of experimenting
4
1
78
So REPA works with both siglip and dino, but most people use dino as it has the best performance (by a tiny margin) But if you were doing T2I tasks, wouldnt using siglip be better? You can use it both for text encoding and image representation alignment. Only have to use a
3
0
27
FIBO/Bria4 will soon be 3x faster! Looking into h2 cache too (new paper claiming to be better than teacache)
3
11
71