sway @SwayStar123 X Profile

sway

@SwayStar123

Followers

1K

Following

9K

Media

282

Statuses

1K

Vegan btw, working on diffusion models 日本語できる, 中文学习中

Joined July 2015

Don't wanna be here? Send us removal request.

sway

@SwayStar123

20 hours

Not only is guidance distillation almost a free lunch training wise (very cheap to distill), it also seems to improve the model to avoid some failure cases! 2x faster sampling + slightly better samples :)

7

2

46

sway

@SwayStar123

21 hours

Remember when people made fun of diffusion models for being unable to generate handshakes?

0

18

sway

@SwayStar123

2 days

researchers really wasting no time these days

2

6

143

sway

@SwayStar123

2 days

I dont think this is actually that big of a deal MathArena apex is basically picking out all the problems that majority of the then top LLMs could not solve. So gpt5 was gonna get a 0 by definition, any improvements are bound to look impressive in comparison

Lisan al Gaib

@scaling01

2 days

HOLY SHIT OpenAI got smoked MathArena Apex: Gemini 3.0 Pro - 23.4% GPT-5.1 - 1% ouch

1

0

3

sway

@SwayStar123

2 days

Trying out gemini 3 in cursor, its a lot more methodical in finding bugs. After finding a suspect, it will make a test file with minimal repro and useful logs and only then continue to fix it. Whereas claude/chatgpt will assume they are correct and move on to their fix, which

1

0

9

sway

@SwayStar123

2 days

gemini 3 model card

0

sway

@SwayStar123

4 days

Windsurf is actually pretty good and possibly even better than cursor

0

8

sway

@SwayStar123

4 days

cfg/guidance distillation seems to be pretty much free lunch. Why do no labs apart from BFL do it?

4

1

46

sway

@SwayStar123

4 days

ml is so funny cuz sometimes you have bugs and the model still learns and sometimes you have no bugs and the model still doesnt learn

0

10

sway

@SwayStar123

6 days

gpt 5.1 codex - garbage, absolutely broken in windsurf gpt 5.1 - pretty good, free so you can use even high thinking unlimited. But i think the thinking is not well integrated (it doesnt remember its previous thoughts) sonnet 4.5 - still the goat, need to use claude to solve bugs

0

1

7

sway

@SwayStar123

8 days

Managed to single handedly revive a dead slack channel by sending the first message there in 3 years, and now suddenly everyone's using it

0

4

sway

@SwayStar123

9 days

Anyone wanna give me 8xh100 for a few months? Aim is to get XL@100k = 2 FID or less Theres so much room for improvement I didnt have compute/time to test -Alternative optimizer (Muon, SOAP/Shampoo, Prodigy) -Newer techniques like SPRINT -Custom VAE (Can you outdo RAE with an

1

2

41

sway

@SwayStar123

11 days

Boss arenas are beautiful too

0

2

sway

@SwayStar123

11 days

Black Myth Wukong is such a beautiful game I could make most of the random walkways my wallpapers

1

0

4

sway

@SwayStar123

12 days

ok nvm tiny ae is unstable to train, switching to invae from repa-e, modifying for my purposes

0

1

sway

@SwayStar123

12 days

claude casually just adding its whole thinking process to my code... Why does claude act like this when its in cursor

1

0

5

sway

@SwayStar123

12 days

Trying out a tinyae arch f32c256 with REPA-E, aligning both the latents, encoder, and decoder. Also added structured latents loss and noise shifting based on channel size from RAE Img is diffusion model samples at 4k steps. Will eval at 100k to see if this is close to RAE

sway

@SwayStar123

1 month

so RAE is really cool, but im thinking, even with their DDT head, scaling this to higher compression would be really hard (naively just increasing patch size and modelling 768*4+ dims) So autoencoders might not quite be dead yet for deep compression Im thinking of experimenting

4

1

78

sway

@SwayStar123

16 days

So REPA works with both siglip and dino, but most people use dino as it has the best performance (by a tiny margin) But if you were doing T2I tasks, wouldnt using siglip be better? You can use it both for text encoding and image representation alignment. Only have to use a

3

0

27

sway

@SwayStar123

17 days

didnt realize @EMostaque started working at AMD

2

4

26

sway

@SwayStar123

17 days

FIBO/Bria4 will soon be 3x faster! Looking into h2 cache too (new paper claiming to be better than teacache)

3

11

71