Rosinality @rosinality X Profile

Rosinality

@rosinality

Followers

3K

Following

23K

Media

479

Statuses

32K

ML Engineer

https://t.co/c2L6BUjhiX

Seoul, Korea

Joined October 2008

Don't wanna be here? Send us removal request.

Rosinality

@rosinality

12 hours

Again, I love this podcast. (This episode is about Kimi Linear with Songlin Yang).

1

0

10

Rosinality

@rosinality

1 day

Paper

arxiv.org

Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image...

0

1

Rosinality

@rosinality

1 day

Interleaved reasoning with image generation. Zoom-in is able to emerge from the training. Interesting. With better pretraining maybe visual reasoning can start to be solved.

1

3

17

Rosinality

@rosinality

1 day

Paper

arxiv.org

The efficiency of large language models (LLMs) is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis...

0

1

5

Rosinality

@rosinality

1 day

Autoregressive prediction of latents from an autoencoder which compresses K tokens. They used energy loss for prediction instead of more popular ones like flow matching. Would this be effective for general continuous autoregression? (e.g., images).

4

13

119

Rosinality

@rosinality

2 days

(Though this constantly causes an urge to reinvent the inference stack.)

0

1

Rosinality

@rosinality

2 days

Inference stack becomes more and more sophisticated and fast-moving. It should be thoroughly verified, and if you develop a post-training stack, it is inevitable to thoroughly understand it.

1

0

6

Rosinality

@rosinality

2 days

It's highly probable that there is a bug in kernel-device matching if it happens only for certain devices.

Yingru Li

@RichardYRLi

2 days

@danielhanchen, glad you liked the post! You're spot on to suspect lower-level implementation issues. That's exactly what we found in the original blog. The disable_cascade_attn finding (Sec 4.2.4) was the symptom, but the root cause was that silent FlashAttention-2 kernel bug

1

2

28

Rosinality

@rosinality

4 days

Using bfloat16 in RL does not worry me; maybe someday needing to do it in nvfp4 does

1

2

42

Rosinality

@rosinality

4 days

Only one character diff!

Grad

@Grad62304977

4 days

https://t.co/W5Ky8Rcqwa

0

17

Johannes Hagemann

@johannes_hage

4 days

shout out to @Grad62304977 for discovering this paper on arxiv tonight and then immediately reproducing it

4

1

32

samsja

@samsja19

4 days

https://t.co/hiSAboQRFw

Rosinality

@rosinality

4 days

FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising!

8

39

837

Rosinality

@rosinality

4 days

Paper

arxiv.org

We present AMO-Bench, an Advanced Mathematical reasoning benchmark with Olympiad level or even higher difficulty, comprising 50 human-crafted problems. Existing benchmarks have widely leveraged...

0

3

Rosinality

@rosinality

4 days

New math benchmark. Scores are quite weaker than I expected.

3

7

103

Rosinality

@rosinality

4 days

Paper

arxiv.org

The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p....

0

11

Rosinality

@rosinality

4 days

Training an LLM to predict temperature and top-P parameters for its tokens. It is simply trained using autoregressive loss. Hmm.

11

15

198

Rosinality

@rosinality

4 days

Paper

arxiv.org

Reinforcement learning (RL) fine-tuning of large language models (LLMs) often suffers from instability due to the numerical mismatch between the training and inference policies. While prior work...

0

6

76

Rosinality

@rosinality

4 days

FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising!

31

127

1K