wh
@nrehiew_
Followers
15K
Following
3K
Media
1K
Statuses
4K
eng primarily, ml mostly, research previously
Joined October 2023
2
11
192
Nice post! While you get some overlap, you still need custom schedules since All2A is too slow. Training/prefill: chunk the sequence and do chunked MoE GEMM during A2A Inference: Split up attention into 2 parts. MoE GEMM is the only bubble which you deal with via wide EP
I was thinking about how to improve MFU on MoE models, and I wondered why you couldn't simply move the router to before the self-attn, so that you could overlap the routing and self-attn computations. Well actually you can just do that and someone already did, in fact it made
2
5
55
i think the easiest way to read this is to look at the recurrence as (kv)_new = old_kv * thing_in_red + current kv?
0
0
6
Some long but hopefully helpful thoughts about self learning these things: The list is actually in reverse chronological order imo if you are starting from 0. "familiarity with the AI research landscape" <> "high taste". I think it starts with reading. Read papers, read
Genuine question: where does one even go to learn a lot of this stuff? I doubt it is in school. Do people find resources online and self study? Is this just a sign of the sector maturing and people are expected to learn these skills at other jobs? I’ve learned a decent amount of
9
42
583
Unfortunately, this has left me with no choice but to read up on the Deltanet literature
6
11
247
More papers are going viral because people seem to think they are making some mindblowing claim about AGI/Intelligence/Model Consciousness etc This is surprising since the actual papers themselves often don't say anything of that sort.
2
2
58
Pour one out tonight for the attention purists. Looks like linear attention is now the meta now. No more just putting the quadratic attention in the bag
7
0
62
Continual Learning is so overhyped right now that you can engagement farm by quoting any new research and tweeting something like “this solves continual learning btw”
10
1
93
Nice paper, really well written and simple to understand. The method feels very similar to DSA. Some thoughts on this paper + continual learning in general. I think any form of continual learning is either going to be LoRA style (FT a subset of params that have large impact on
6
0
23
Continual Learning via Sparse Memory Finetuning Jessy Lin, Luke Zettlemoyer, Gargi Ghosh, Wen-Tau Yih, Aram Markosyan, Vincent-Pierre Berges, Barlas Oğuz https://t.co/v7KhRIdXf1
arxiv.org
Modern language models are powerful, but typically static after deployment. A major obstacle to building models that continually learn over time is catastrophic forgetting, where updating on new...
1
1
12
They do a large hyperparam sweep and in all cases their approach leads to almost no forgetting. The last section they show that you can only finetune a few sparse KVs.
1
1
8
The main thing to worry is catastrophic forgetting. In these plots they train on the first few samples from one benchmark while still having good holdout set behaviour compared to LoRA or fullFT.
1
1
7