rosinality Profile Banner
Rosinality Profile
Rosinality

@rosinality

Followers
3K
Following
23K
Media
479
Statuses
32K

ML Engineer

Seoul, Korea
Joined October 2008
Don't wanna be here? Send us removal request.
@rosinality
Rosinality
12 hours
Again, I love this podcast. (This episode is about Kimi Linear with Songlin Yang).
1
0
10
@rosinality
Rosinality
1 day
Interleaved reasoning with image generation. Zoom-in is able to emerge from the training. Interesting. With better pretraining maybe visual reasoning can start to be solved.
1
3
17
@rosinality
Rosinality
1 day
Autoregressive prediction of latents from an autoencoder which compresses K tokens. They used energy loss for prediction instead of more popular ones like flow matching. Would this be effective for general continuous autoregression? (e.g., images).
4
13
119
@rosinality
Rosinality
2 days
(Though this constantly causes an urge to reinvent the inference stack.)
0
0
1
@rosinality
Rosinality
2 days
Inference stack becomes more and more sophisticated and fast-moving. It should be thoroughly verified, and if you develop a post-training stack, it is inevitable to thoroughly understand it.
1
0
6
@rosinality
Rosinality
2 days
It's highly probable that there is a bug in kernel-device matching if it happens only for certain devices.
@RichardYRLi
Yingru Li
2 days
@danielhanchen, glad you liked the post! You're spot on to suspect lower-level implementation issues. That's exactly what we found in the original blog. The disable_cascade_attn finding (Sec 4.2.4) was the symptom, but the root cause was that silent FlashAttention-2 kernel bug
1
2
28
@rosinality
Rosinality
4 days
Using bfloat16 in RL does not worry me; maybe someday needing to do it in nvfp4 does
1
2
42
@rosinality
Rosinality
4 days
Only one character diff!
0
0
17
@johannes_hage
Johannes Hagemann
4 days
shout out to @Grad62304977 for discovering this paper on arxiv tonight and then immediately reproducing it
4
1
32
@samsja19
samsja
4 days
@rosinality
Rosinality
4 days
FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising!
8
39
837
@rosinality
Rosinality
4 days
New math benchmark. Scores are quite weaker than I expected.
3
7
103
@rosinality
Rosinality
4 days
Training an LLM to predict temperature and top-P parameters for its tokens. It is simply trained using autoregressive loss. Hmm.
11
15
198
@rosinality
Rosinality
4 days
FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising!
31
127
1K