wh @nrehiew_ X Profile

wh

@nrehiew_

Followers

15K

Following

3K

Media

1K

Statuses

4K

eng primarily, ml mostly, research previously

Joined October 2023

Don't wanna be here? Send us removal request.

wh

@nrehiew_

19 hours

It is surprising, however, that their BF16 GRPO collapses to this extent. Especially since this is on what is effectively a subset of MATH and not some weird environment

wh

@nrehiew_

1 day

Is this what a free lunch looks like?

3

1

45

wh

@nrehiew_

23 hours

https://t.co/Ec9tUyCIsF

wh

@nrehiew_

1 day

Is this what a free lunch looks like?

2

11

192

wh

@nrehiew_

1 day

??

1

0

24

wh

@nrehiew_

1 day

Is this what a free lunch looks like?

4

6

145

wh

@nrehiew_

2 days

Nice post! While you get some overlap, you still need custom schedules since All2A is too slow. Training/prefill: chunk the sequence and do chunked MoE GEMM during A2A Inference: Split up attention into 2 parts. MoE GEMM is the only bubble which you deal with via wide EP

ueaj

@_ueaj

2 days

I was thinking about how to improve MFU on MoE models, and I wondered why you couldn't simply move the router to before the self-attn, so that you could overlap the routing and self-attn computations. Well actually you can just do that and someone already did, in fact it made

2

5

55

wh

@nrehiew_

2 days

i think the easiest way to read this is to look at the recurrence as (kv)_new = old_kv * thing_in_red + current kv?

0

6

wh

@nrehiew_

2 days

Some long but hopefully helpful thoughts about self learning these things: The list is actually in reverse chronological order imo if you are starting from 0. "familiarity with the AI research landscape" <> "high taste". I think it starts with reading. Read papers, read

Rota

@pli_cachete

4 days

Genuine question: where does one even go to learn a lot of this stuff? I doubt it is in school. Do people find resources online and self study? Is this just a sign of the sector maturing and people are expected to learn these skills at other jobs? I’ve learned a decent amount of

9

42

583

wh

@nrehiew_

2 days

Thankfully, theres a really nice glossary in the KIMI Delta Attention paper that covers most of the notable variants

wh

@nrehiew_

2 days

Unfortunately, this has left me with no choice but to read up on the Deltanet literature

4

30

401

wh

@nrehiew_

2 days

Unfortunately, this has left me with no choice but to read up on the Deltanet literature

6

11

247

wh

@nrehiew_

3 days

More papers are going viral because people seem to think they are making some mindblowing claim about AGI/Intelligence/Model Consciousness etc This is surprising since the actual papers themselves often don't say anything of that sort.

2

58

wh

@nrehiew_

3 days

Finally got access to the new B200s

2

0

24

wh

@nrehiew_

3 days

Pour one out tonight for the attention purists. Looks like linear attention is now the meta now. No more just putting the quadratic attention in the bag

7

0

62

wh

@nrehiew_

4 days

Continual Learning is so overhyped right now that you can engagement farm by quoting any new research and tweeting something like “this solves continual learning btw”

10

1

93

wh

@nrehiew_

5 days

Most problems solveable by RL are already solveable by the base model at pass@k. (caveating potential contamination with Qwen2.5)

3

0

22

wh

@nrehiew_

5 days

This holds true across many algorithms: ppo, grpo ,dapo, reinforce++ etc

1

17

wh

@nrehiew_

5 days

RL improves pass@1 but doesn’t change much for pass@{32,64,128,256} Further empirical results backing up mode/entropy collapse in RL

19

5

141

wh

@nrehiew_

8 days

Nice paper, really well written and simple to understand. The method feels very similar to DSA. Some thoughts on this paper + continual learning in general. I think any form of continual learning is either going to be LoRA style (FT a subset of params that have large impact on

6

0

23

wh

@nrehiew_

8 days

Continual Learning via Sparse Memory Finetuning Jessy Lin, Luke Zettlemoyer, Gargi Ghosh, Wen-Tau Yih, Aram Markosyan, Vincent-Pierre Berges, Barlas Oğuz https://t.co/v7KhRIdXf1

arxiv.org

Modern language models are powerful, but typically static after deployment. A major obstacle to building models that continually learn over time is catastrophic forgetting, where updating on new...

1

12

wh

@nrehiew_

8 days

They do a large hyperparam sweep and in all cases their approach leads to almost no forgetting. The last section they show that you can only finetune a few sparse KVs.

1

8

wh

@nrehiew_

8 days

The main thing to worry is catastrophic forgetting. In these plots they train on the first few samples from one benchmark while still having good holdout set behaviour compared to LoRA or fullFT.

1

7