 
            
              Grad
            
            @Grad62304977
Followers
                8K
              Following
                40K
              Media
                113
              Statuses
                3K
              
              
              Joined October 2020
            
            
           Many people are confused by Minimax’s recent return to full attention - especially since it was the first large-scale pivot toward hybrid linear attention - and by Kimi’s later adoption of hybrid linear variants (as well as earlier attempts by Qwen3-Next, or Qwen3.5). I actually 
          
                
                7
              
              
                
                36
              
              
                
                284
              
             Kimi Linear Tech Report is dropped! 🚀  https://t.co/LwNB2sQnzM  Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi 
          
            
            huggingface.co
            
                
                21
              
              
                
                121
              
              
                
                818
              
             Update: After an incredible year at @PrimeIntellect, I have decided to take my next step in August. Grateful that I got to work with such a talented team and build the best open-source RL infra! For now, I'm continuing to work on RL for coding agents. Will share updates :) 
          
                
                31
              
              
                
                8
              
              
                
                392
              
             Interested in seeing the ablations here compared to the original paper (where did they find it actually is weaker), would prob help the community a good amount. The original post mentions converting the model into a hybrid SWA which seems pretty different to pretraining like 
          
              @JingweiZuo Yes. And this is one of the reasons that we do not use lightning attention in M2.
            
          
                
                2
              
              
                
                1
              
              
                
                56
              
             A year ago, AI could barely use a few tools. Now it can handle hundreds. Mike Krieger, CPO Anthropic explains why this changes everything for real-world agents. 
          
                
                0
              
              
                
                0
              
              
                
                7
              
            
            @yifan_zhang_ @OpenAI @_akhaliq @vllm_project A small correction, M2 is a full-attention model. Actually, during our pre-training phase, we tried to transform the full-attention model to a OSS-like structure using SWA. But we found that it hurt the performance of multi-hop reasoning, so we finally did not use this setting.
          
          
                
                4
              
              
                
                9
              
              
                
                132
              
             Looks like nvm no SWA?? Very weird 
          
          
                
                5
              
              
                
                0
              
              
                
                40
              
             Looks like they've abandoned naive linear attention for SWA 
           Minimax M2 is a 230B 10AB MoE for comparison, Minimax-M1 was 456B total with 45.9B active, ie a typical V3-class (with some differences like 'lightning attention') M2 apparently beats the hell out of M1 and everything below it Very good progress from Hailuo 
            
                
                10
              
              
                
                8
              
              
                
                234
              
             Hydrate. Hustle. GO! CELSIUS HYDRATION - The ultimate hydration for every move. CELSIUS. LIVE. FIT. GO! 
          
                
                203
              
              
                
                380
              
              
                
                5K
              
             Tbh I never really got 10+ year timelines. To me they just mean that we need 1 or more breakthroughs and we just assume a decade is enough to find them 
           The @karpathy interview 0:00:00 – AGI is still a decade away 0:30:33 – LLM cognitive deficits 0:40:53 – RL is terrible 0:50:26 – How do humans learn? 1:07:13 – AGI will blend into 2% GDP growth 1:18:24 – ASI 1:33:38 – Evolution of intelligence & culture 1:43:43 - Why self 
            
                
                33
              
              
                
                16
              
              
                
                828
              
             Seems like this was an important part of the paper. I’ve also found this instability, even without length normalisation, DeepSeek r1 style 
           Wish to build scaling laws for RL but not sure how to scale? Or what scales? Or would RL even scale predictably? We introduce: The Art of Scaling Reinforcement Learning Compute for LLMs 
            
                
                6
              
              
                
                2
              
              
                
                113
              
             Sneak peak from a paper about scaling RL compute for LLMs: probably the most compute-expensive paper I've worked on, but hoping that others can run experiments cheaply for the science of scaling RL. Coincidentally, this is similar motivation to what we had for the NeurIPS best 
          
                
                11
              
              
                
                36
              
              
                
                417
              
             Seems like no one saw this either. They propose to RL with no-think and that improves performance with and without thinking, as well as shorter reasoning with thinking. Looks like pretty good non thinking performance in Table 8, and seems like GPQA is boosted heavily with this 
          
                
                7
              
              
                
                6
              
              
                
                169
              
             I guess we’re doing a fine grained MoE sideways now 
           🚀 Introducing DeepSeek-V3.2-Exp — our latest experimental model! ✨ Built on V3.1-Terminus, it debuts DeepSeek Sparse Attention(DSA) for faster, more efficient training & inference on long context. 👉 Now live on App, Web, and API. 💰 API prices cut by 50%+! 1/n 
          
                
                3
              
              
                
                0
              
              
                
                186
              
             10 Years of experience with simulated trading. Turn your skills into rewards now! 
          
                
                0
              
              
                
                2
              
              
                
                17
              
             The high GPQA-Diamond scores remind me of o1. Would not be surprised if o1 is straight RL and theres some magic in RL from a base 
           Very interesting how R1-Zero is still far ahead of the final r1 in certain benchmarks like GPQA-Diamond and CNMO. Also a GRPO clip ratio of 10 seems to pretty much confirm that they use a sequence level importance ratio as their formula shows, different to the original GRPO and 
            
                
                0
              
              
                
                1
              
              
                
                33
              
             Also a prompt batch size of just 32, group size of 16 (good but expected higher for a large scale run like this, esp for r1-zero), and 16 minibatch updates seem like pretty weak hparams (esp the batch size). Surprised r1 turned out as good as it did lol, i guess the v3 base is 
          
                
                0
              
              
                
                0
              
              
                
                28
              
             
             
             
             
               
             
               
               
             
               
               
             
             
               
            