Grad Profile
Grad

@Grad62304977

Followers
8K
Following
40K
Media
113
Statuses
3K

Joined October 2020
Don't wanna be here? Send us removal request.
@SonglinYang4
Songlin Yang
6 hours
Many people are confused by Minimax’s recent return to full attention - especially since it was the first large-scale pivot toward hybrid linear attention - and by Kimi’s later adoption of hybrid linear variants (as well as earlier attempts by Qwen3-Next, or Qwen3.5). I actually
7
36
284
@Kimi_Moonshot
Kimi.ai
12 hours
Kimi Linear Tech Report is dropped! 🚀 https://t.co/LwNB2sQnzM Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi
Tweet card summary image
huggingface.co
21
121
818
@MatternJustus
Justus Mattern
2 days
Update: After an incredible year at @PrimeIntellect, I have decided to take my next step in August. Grateful that I got to work with such a talented team and build the best open-source RL infra! For now, I'm continuing to work on RL for coding agents. Will share updates :)
31
8
392
@Grad62304977
Grad
4 days
Interested in seeing the ablations here compared to the original paper (where did they find it actually is weaker), would prob help the community a good amount. The original post mentions converting the model into a hybrid SWA which seems pretty different to pretraining like
@zpysky1125
Pengyu Zhao
4 days
@JingweiZuo Yes. And this is one of the reasons that we do not use lightning attention in M2.
2
1
56
@superhuman_ai
Superhuman AI
17 hours
A year ago, AI could barely use a few tools. Now it can handle hundreds. Mike Krieger, CPO Anthropic explains why this changes everything for real-world agents.
0
0
7
@zpysky1125
Pengyu Zhao
4 days
@yifan_zhang_ @OpenAI @_akhaliq @vllm_project A small correction, M2 is a full-attention model. Actually, during our pre-training phase, we tried to transform the full-attention model to a OSS-like structure using SWA. But we found that it hurt the performance of multi-hop reasoning, so we finally did not use this setting.
4
9
132
@Grad62304977
Grad
4 days
Looks like nvm no SWA?? Very weird
@shxf0072
Joey (e/λ)
4 days
@Grad62304977 nope, i think its for different m2 variants bigger maybe
5
0
40
@Grad62304977
Grad
4 days
Looks like they've abandoned naive linear attention for SWA
@teortaxesTex
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
6 days
Minimax M2 is a 230B 10AB MoE for comparison, Minimax-M1 was 456B total with 45.9B active, ie a typical V3-class (with some differences like 'lightning attention') M2 apparently beats the hell out of M1 and everything below it Very good progress from Hailuo
10
8
234
@Grad62304977
Grad
7 days
😱😱😱
7
10
261
@CelsiusOfficial
CELSIUS Energy Drink
2 months
Hydrate. Hustle. GO! CELSIUS HYDRATION - The ultimate hydration for every move. CELSIUS. LIVE. FIT. GO!
203
380
5K
@Grad62304977
Grad
8 days
This is vllm btw
0
0
36
@Grad62304977
Grad
8 days
Jumpscare for any llm RL folk
6
8
296
@Grad62304977
Grad
10 days
Tbh I never really got 10+ year timelines. To me they just mean that we need 1 or more breakthroughs and we just assume a decade is enough to find them
@dwarkesh_sp
Dwarkesh Patel
13 days
The @karpathy interview 0:00:00 – AGI is still a decade away 0:30:33 – LLM cognitive deficits 0:40:53 – RL is terrible 0:50:26 – How do humans learn? 1:07:13 – AGI will blend into 2% GDP growth 1:18:24 – ASI 1:33:38 – Evolution of intelligence & culture 1:43:43 - Why self
33
16
828
@Grad62304977
Grad
12 days
Seems like this was an important part of the paper. I’ve also found this instability, even without length normalisation, DeepSeek r1 style
@Devvrit_Khatri
Devvrit
14 days
Wish to build scaling laws for RL but not sure how to scale? Or what scales? Or would RL even scale predictably? We introduce: The Art of Scaling Reinforcement Learning Compute for LLMs
6
2
113
@ItsJonHowell
Jon Howell
1 day
We shipped the EveryWear app today 🥲 Sound on.
1
3
29
@agarwl_
Rishabh Agarwal
15 days
Sneak peak from a paper about scaling RL compute for LLMs: probably the most compute-expensive paper I've worked on, but hoping that others can run experiments cheaply for the science of scaling RL. Coincidentally, this is similar motivation to what we had for the NeurIPS best
11
36
417
@Grad62304977
Grad
23 days
Seems like no one saw this either. They propose to RL with no-think and that improves performance with and without thinking, as well as shorter reasoning with thinking. Looks like pretty good non thinking performance in Table 8, and seems like GPQA is boosted heavily with this
7
6
169
@Grad62304977
Grad
1 month
I guess we’re doing a fine grained MoE sideways now
@deepseek_ai
DeepSeek
1 month
🚀 Introducing DeepSeek-V3.2-Exp — our latest experimental model! ✨ Built on V3.1-Terminus, it debuts DeepSeek Sparse Attention(DSA) for faster, more efficient training & inference on long context. 👉 Now live on App, Web, and API. 💰 API prices cut by 50%+! 1/n
3
0
186
@FTMO_com
FTMO.com
24 days
10 Years of experience with simulated trading. Turn your skills into rewards now!
0
2
17
@Grad62304977
Grad
1 month
Funnily enough, deepseek r1 didn't even use the original GRPO
@MillionInt
Jerry Tworek
1 month
GRPO release has in a large way accelerated RL research program of most US research labs
1
1
76
@Grad62304977
Grad
1 month
The high GPQA-Diamond scores remind me of o1. Would not be surprised if o1 is straight RL and theres some magic in RL from a base
@Grad62304977
Grad
1 month
Very interesting how R1-Zero is still far ahead of the final r1 in certain benchmarks like GPQA-Diamond and CNMO. Also a GRPO clip ratio of 10 seems to pretty much confirm that they use a sequence level importance ratio as their formula shows, different to the original GRPO and
0
1
33
@Grad62304977
Grad
1 month
Also a prompt batch size of just 32, group size of 16 (good but expected higher for a large scale run like this, esp for r1-zero), and 16 minibatch updates seem like pretty weak hparams (esp the batch size). Surprised r1 turned out as good as it did lol, i guess the v3 base is
0
0
28