nrehiew_ Profile Banner
wh Profile
wh

@nrehiew_

Followers
15K
Following
3K
Media
1K
Statuses
4K

eng primarily, ml mostly, research previously

Joined October 2023
Don't wanna be here? Send us removal request.
@nrehiew_
wh
19 hours
It is surprising, however, that their BF16 GRPO collapses to this extent. Especially since this is on what is effectively a subset of MATH and not some weird environment
@nrehiew_
wh
1 day
Is this what a free lunch looks like?
3
1
45
@nrehiew_
wh
23 hours
@nrehiew_
wh
1 day
Is this what a free lunch looks like?
2
11
192
@nrehiew_
wh
1 day
??
1
0
24
@nrehiew_
wh
1 day
Is this what a free lunch looks like?
4
6
145
@nrehiew_
wh
2 days
Nice post! While you get some overlap, you still need custom schedules since All2A is too slow. Training/prefill: chunk the sequence and do chunked MoE GEMM during A2A Inference: Split up attention into 2 parts. MoE GEMM is the only bubble which you deal with via wide EP
@_ueaj
ueaj
2 days
I was thinking about how to improve MFU on MoE models, and I wondered why you couldn't simply move the router to before the self-attn, so that you could overlap the routing and self-attn computations. Well actually you can just do that and someone already did, in fact it made
2
5
55
@nrehiew_
wh
2 days
i think the easiest way to read this is to look at the recurrence as (kv)_new = old_kv * thing_in_red + current kv?
0
0
6
@nrehiew_
wh
2 days
Some long but hopefully helpful thoughts about self learning these things: The list is actually in reverse chronological order imo if you are starting from 0. "familiarity with the AI research landscape" <> "high taste". I think it starts with reading. Read papers, read
@pli_cachete
Rota
4 days
Genuine question: where does one even go to learn a lot of this stuff? I doubt it is in school. Do people find resources online and self study? Is this just a sign of the sector maturing and people are expected to learn these skills at other jobs? I’ve learned a decent amount of
9
42
583
@nrehiew_
wh
2 days
Thankfully, theres a really nice glossary in the KIMI Delta Attention paper that covers most of the notable variants
@nrehiew_
wh
2 days
Unfortunately, this has left me with no choice but to read up on the Deltanet literature
4
30
401
@nrehiew_
wh
2 days
Unfortunately, this has left me with no choice but to read up on the Deltanet literature
6
11
247
@nrehiew_
wh
3 days
More papers are going viral because people seem to think they are making some mindblowing claim about AGI/Intelligence/Model Consciousness etc This is surprising since the actual papers themselves often don't say anything of that sort.
2
2
58
@nrehiew_
wh
3 days
Finally got access to the new B200s
2
0
24
@nrehiew_
wh
3 days
Pour one out tonight for the attention purists. Looks like linear attention is now the meta now. No more just putting the quadratic attention in the bag
7
0
62
@nrehiew_
wh
4 days
Continual Learning is so overhyped right now that you can engagement farm by quoting any new research and tweeting something like “this solves continual learning btw”
10
1
93
@nrehiew_
wh
5 days
Most problems solveable by RL are already solveable by the base model at pass@k. (caveating potential contamination with Qwen2.5)
3
0
22
@nrehiew_
wh
5 days
This holds true across many algorithms: ppo, grpo ,dapo, reinforce++ etc
1
1
17
@nrehiew_
wh
5 days
RL improves pass@1 but doesn’t change much for pass@{32,64,128,256} Further empirical results backing up mode/entropy collapse in RL
19
5
141
@nrehiew_
wh
8 days
Nice paper, really well written and simple to understand. The method feels very similar to DSA. Some thoughts on this paper + continual learning in general. I think any form of continual learning is either going to be LoRA style (FT a subset of params that have large impact on
6
0
23
@nrehiew_
wh
8 days
Continual Learning via Sparse Memory Finetuning Jessy Lin, Luke Zettlemoyer, Gargi Ghosh, Wen-Tau Yih, Aram Markosyan, Vincent-Pierre Berges, Barlas Oğuz https://t.co/v7KhRIdXf1
Tweet card summary image
arxiv.org
Modern language models are powerful, but typically static after deployment. A major obstacle to building models that continually learn over time is catastrophic forgetting, where updating on new...
1
1
12
@nrehiew_
wh
8 days
They do a large hyperparam sweep and in all cases their approach leads to almost no forgetting. The last section they show that you can only finetune a few sparse KVs.
1
1
8
@nrehiew_
wh
8 days
The main thing to worry is catastrophic forgetting. In these plots they train on the first few samples from one benchmark while still having good holdout set behaviour compared to LoRA or fullFT.
1
1
7