Grad @Grad62304977 X Profile

Grad

@Grad62304977

Followers

8K

Following

40K

Media

113

Statuses

3K

Joined October 2020

Don't wanna be here? Send us removal request.

Songlin Yang

@SonglinYang4

6 hours

Many people are confused by Minimax’s recent return to full attention - especially since it was the first large-scale pivot toward hybrid linear attention - and by Kimi’s later adoption of hybrid linear variants (as well as earlier attempts by Qwen3-Next, or Qwen3.5). I actually

7

36

284

Kimi.ai

@Kimi_Moonshot

12 hours

Kimi Linear Tech Report is dropped! 🚀 https://t.co/LwNB2sQnzM Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi

huggingface.co

21

121

818

Justus Mattern

@MatternJustus

2 days

Update: After an incredible year at @PrimeIntellect, I have decided to take my next step in August. Grateful that I got to work with such a talented team and build the best open-source RL infra! For now, I'm continuing to work on RL for coding agents. Will share updates :)

31

8

392

Grad

@Grad62304977

4 days

Interested in seeing the ablations here compared to the original paper (where did they find it actually is weaker), would prob help the community a good amount. The original post mentions converting the model into a hybrid SWA which seems pretty different to pretraining like

Pengyu Zhao

@zpysky1125

4 days

@JingweiZuo Yes. And this is one of the reasons that we do not use lightning attention in M2.

2

1

56

Superhuman AI

@superhuman_ai

17 hours

A year ago, AI could barely use a few tools. Now it can handle hundreds. Mike Krieger, CPO Anthropic explains why this changes everything for real-world agents.

0

7

Pengyu Zhao

@zpysky1125

4 days

@yifan_zhang_ @OpenAI @_akhaliq @vllm_project A small correction, M2 is a full-attention model. Actually, during our pre-training phase, we tried to transform the full-attention model to a OSS-like structure using SWA. But we found that it hurt the performance of multi-hop reasoning, so we finally did not use this setting.

4

9

132

Grad

@Grad62304977

4 days

Looks like nvm no SWA?? Very weird

Joey (e/λ)

@shxf0072

4 days

@Grad62304977 nope, i think its for different m2 variants bigger maybe

5

0

40

Grad

@Grad62304977

4 days

Looks like they've abandoned naive linear attention for SWA

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)

@teortaxesTex

6 days

Minimax M2 is a 230B 10AB MoE for comparison, Minimax-M1 was 456B total with 45.9B active, ie a typical V3-class (with some differences like 'lightning attention') M2 apparently beats the hell out of M1 and everything below it Very good progress from Hailuo

10

8

234

Grad

@Grad62304977

7 days

https://t.co/cCju4nTHZK

arxiv.org

In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M...

0

2

47

Grad

@Grad62304977

7 days

😱😱😱

7

10

261

CELSIUS Energy Drink

@CelsiusOfficial

2 months

Hydrate. Hustle. GO! CELSIUS HYDRATION - The ultimate hydration for every move. CELSIUS. LIVE. FIT. GO!

203

380

5K

Grad

@Grad62304977

8 days

This is vllm btw

0

36

Grad

@Grad62304977

8 days

Jumpscare for any llm RL folk

6

8

296

Grad

@Grad62304977

10 days

Tbh I never really got 10+ year timelines. To me they just mean that we need 1 or more breakthroughs and we just assume a decade is enough to find them

Dwarkesh Patel

@dwarkesh_sp

13 days

The @karpathy interview 0:00:00 – AGI is still a decade away 0:30:33 – LLM cognitive deficits 0:40:53 – RL is terrible 0:50:26 – How do humans learn? 1:07:13 – AGI will blend into 2% GDP growth 1:18:24 – ASI 1:33:38 – Evolution of intelligence & culture 1:43:43 - Why self

33

16

828

Grad

@Grad62304977

12 days

Seems like this was an important part of the paper. I’ve also found this instability, even without length normalisation, DeepSeek r1 style

Devvrit

@Devvrit_Khatri

14 days

Wish to build scaling laws for RL but not sure how to scale? Or what scales? Or would RL even scale predictably? We introduce: The Art of Scaling Reinforcement Learning Compute for LLMs

6

2

113

Jon Howell

@ItsJonHowell

1 day

We shipped the EveryWear app today 🥲 Sound on.

1

3

29

Rishabh Agarwal

@agarwl_

15 days

Sneak peak from a paper about scaling RL compute for LLMs: probably the most compute-expensive paper I've worked on, but hoping that others can run experiments cheaply for the science of scaling RL. Coincidentally, this is similar motivation to what we had for the NeurIPS best

11

36

417

Grad

@Grad62304977

23 days

https://t.co/xTqGjTnmBl

arxiv.org

Reinforcement Learning with Verifiable Reward (RLVR) effectively solves complex tasks but demands extremely long context lengths during training, leading to substantial computational costs. While...

0

1

20

Grad

@Grad62304977

23 days

Seems like no one saw this either. They propose to RL with no-think and that improves performance with and without thinking, as well as shorter reasoning with thinking. Looks like pretty good non thinking performance in Table 8, and seems like GPQA is boosted heavily with this

7

6

169

Grad

@Grad62304977

1 month

I guess we’re doing a fine grained MoE sideways now

DeepSeek

@deepseek_ai

1 month

🚀 Introducing DeepSeek-V3.2-Exp — our latest experimental model! ✨ Built on V3.1-Terminus, it debuts DeepSeek Sparse Attention(DSA) for faster, more efficient training & inference on long context. 👉 Now live on App, Web, and API. 💰 API prices cut by 50%+! 1/n

3

0

186

FTMO.com

@FTMO_com

24 days

10 Years of experience with simulated trading. Turn your skills into rewards now!

0

2

17

Grad

@Grad62304977

1 month

Funnily enough, deepseek r1 didn't even use the original GRPO

Jerry Tworek

@MillionInt

1 month

GRPO release has in a large way accelerated RL research program of most US research labs

1

76

Grad

@Grad62304977

1 month

The high GPQA-Diamond scores remind me of o1. Would not be surprised if o1 is straight RL and theres some magic in RL from a base

Grad

@Grad62304977

1 month

Very interesting how R1-Zero is still far ahead of the final r1 in certain benchmarks like GPQA-Diamond and CNMO. Also a GRPO clip ratio of 10 seems to pretty much confirm that they use a sequence level importance ratio as their formula shows, different to the original GRPO and

0

1

33

Grad

@Grad62304977

1 month

Also a prompt batch size of just 32, group size of 16 (good but expected higher for a large scale run like this, esp for r1-zero), and 16 minibatch updates seem like pretty weak hparams (esp the batch size). Surprised r1 turned out as good as it did lol, i guess the v3 base is

0

28