Rosinality
            
            @rosinality
Followers
                3K
              Following
                23K
              Media
                479
              Statuses
                32K
              ML Engineer
              
              Seoul, Korea
            
            
              
              Joined October 2008
            
            
           Again, I love this podcast. (This episode is about Kimi Linear with Songlin Yang). 
          
                
                1
              
              
                
                0
              
              
                
                10
              
             Interleaved reasoning with image generation. Zoom-in is able to emerge from the training. Interesting. With better pretraining maybe visual reasoning can start to be solved. 
          
                
                1
              
              
                
                3
              
              
                
                17
              
             Autoregressive prediction of latents from an autoencoder which compresses K tokens. They used energy loss for prediction instead of more popular ones like flow matching. Would this be effective for general continuous autoregression? (e.g., images). 
          
                
                4
              
              
                
                13
              
              
                
                119
              
             (Though this constantly causes an urge to reinvent the inference stack.) 
          
                
                0
              
              
                
                0
              
              
                
                1
              
             Inference stack becomes more and more sophisticated and fast-moving. It should be thoroughly verified, and if you develop a post-training stack, it is inevitable to thoroughly understand it. 
          
                
                1
              
              
                
                0
              
              
                
                6
              
             It's highly probable that there is a bug in kernel-device matching if it happens only for certain devices. 
          
              @danielhanchen, glad you liked the post! You're spot on to suspect lower-level implementation issues. That's exactly what we found in the original blog. The disable_cascade_attn finding (Sec 4.2.4) was the symptom, but the root cause was that silent FlashAttention-2 kernel bug
            
            
                
                1
              
              
                
                2
              
              
                
                28
              
             Using bfloat16 in RL does not worry me; maybe someday needing to do it in nvfp4 does 
          
                
                1
              
              
                
                2
              
              
                
                42
              
             Only one character diff! 
          
          
                
                0
              
              
                
                0
              
              
                
                17
              
             shout out to @Grad62304977 for discovering this paper on arxiv tonight and then immediately reproducing it 
          
                
                4
              
              
                
                1
              
              
                
                32
              
             Training an LLM to predict temperature and top-P parameters for its tokens. It is simply trained using autoregressive loss. Hmm. 
          
                
                11
              
              
                
                15
              
              
                
                198
              
             FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising! 
          
                
                31
              
              
                
                127
              
              
                
                1K