Naman Goyal Profile
Naman Goyal

@NamanGoyal21

Followers
1,035
Following
563
Media
1
Statuses
176

Research engineer, LLM scaling at GenAI Meta | Worked on: llama2, llama, OPT, blenderbot, XLMR, Bart, Roberta

Joined November 2012
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
@NamanGoyal21
Naman Goyal
1 year
Its crazy that, at 60% Model FLOPS (FP8) Utilization on H100, original GPT3 configuration can be trained in 3 days on 1024 H100s and PaLM on 12 days on 2048 H100s. That's roughly 50x lesser gpu hours for GPT3 paper 3 years back, and 9x lesser for palm released 9 months back
13
37
392
@NamanGoyal21
Naman Goyal
7 months
One of the unfortunate things GPT4 architecture leak caused is, convincing many smart researchers across various labs that sparse models are the way to achieve GPT4 quality model. Data quality and FLOPS is all that matters is such a simple yet hard paradigm to follow.
8
9
227
@NamanGoyal21
Naman Goyal
7 months
Finished 30/30 radiation therapy sessions today. Past 3-4 months have been one of the most challenging part of my life. Recovery from surgery and radiation therapy was quite physically and mentally challenging. With due respect, Cancer, please stay from me from now on.
23
0
177
@NamanGoyal21
Naman Goyal
4 years
Facebook AI Research's sequence modeling library @fairseq has made it's twitter debut. Please follow for latest updates.
2
10
43
@NamanGoyal21
Naman Goyal
6 months
@StasBekman the best GEMM tflops across various large matmul sizes I got for A100 was ~286. And for H100 it’s currently about 750-800 for bf16 and 1550-1600 for fp8. I think and hope GEMM performance will improve overtime as NVIDIA optimises matmul for h100 more with newer cublas versions
2
0
12
@NamanGoyal21
Naman Goyal
1 year
@jekbradbury I do agree that its gonna need decent amount of engineering work! Though my guess is any team with access to stable H100 cluster with 400Gbps Infiniband (or similar interconnect should reach there by max end of year.
2
1
10
@NamanGoyal21
Naman Goyal
4 years
@ragtdata @facebookai @ylecun Yes, we are going to release the pretrained models soon.
0
0
6
@NamanGoyal21
Naman Goyal
5 months
@dome_271 The easiest way would be to disable flatten params and then set lr 0.0 for the params you don’t wanna update. I think after setting flatten params false, setting requries_grad=False also should work, but gotta check this to be sure
2
0
4
@NamanGoyal21
Naman Goyal
2 years
@borisdayma @andrew_n_carr We recently noticed that the scalar of LN is also not needed at least beyond 6.7B model scale.
1
0
4
@NamanGoyal21
Naman Goyal
2 years
@giffmana @_arohan_ @achowdhery @arankomatsuzaki And also from one less inter GPU communication within tensor parallel gpus, which PaLM was doing. Interestingly, we are able to remove scaler of LN also without losing on PPL, so it's just normalization that helps.
2
0
4
@NamanGoyal21
Naman Goyal
1 year
@_arohan_ Congrats on the release. one quick question though if you dont mind. I am not unable to understand what those tokens mean in the table 1, as flop=6ND doesnt seem to match and the abs value look to be too low. Could there be typo or some misunderstanding on my side? cc: @YiTayML
Tweet media one
0
0
4
@NamanGoyal21
Naman Goyal
2 years
@SashaMTL @Jsevillamol I was curious too, above link seems to show 90 kg CO2 per hour per passenger, assuming 333 passengers. So for a full flight it seems to be ~700 * 333 = ~233 tons, which seem to be very close to the ~271 tons.
1
0
4
@NamanGoyal21
Naman Goyal
9 months
@zhansheng I fine tuned MLMs (Roberta, Bart, xlmr from 100M to 10B scale) bunch around that time, but can’t remember this behavior specific to fine tuning. Mainly big models overall were less stable and for that I think two things changed: pre layer norm and bf16
0
0
4
@NamanGoyal21
Naman Goyal
2 years
@StasBekman My guess would be 2, 4 and 3 in that order.
1
0
3
@NamanGoyal21
Naman Goyal
1 year
@arkerr Amazing!!! also curious what are the timelines for FP8 GEMM support in cutlass?
0
0
3
@NamanGoyal21
Naman Goyal
1 year
@OriolVinyalsML I have one request for flamingo output, I am really curious how does Flamingo do on this classic example from Andrej Karpathy's 10 years old blog post
1
0
3
@NamanGoyal21
Naman Goyal
1 year
@YiTayML @artetxem Congratulations, Mikel is amazing! Looking forward to really look things!
0
0
3
@NamanGoyal21
Naman Goyal
1 year
@_arohan_ I remember this paper () from Google Brain used MoE (gshard / switch transformer style configuration ) in VIT models to scale up to 15B parameters. It's conditional compute, so maybe not what you meant to ask?
1
0
2
@NamanGoyal21
Naman Goyal
3 years
@StasBekman It’s totally unrelated to divergence but usually it’s a good idea to keep model_dim % num_heads a power of two instead of 0. I have seen empirical speed up with that
2
0
1
@NamanGoyal21
Naman Goyal
6 months
@StasBekman I don’t know of any publicly available semi-official numbers. Plus I think it might vary a bit with exact configuration of the server, power capping or not, type of cooling etc
1
0
1
@NamanGoyal21
Naman Goyal
5 years
@annargrs @myleott @vesko_st @LukeZettlemoyer @omerlevy_ @YinhanL @mandarjoshi_ @danqi_chen Yes, the variance for RTE and MRPC is higher compared to tasks with bigger datasets. Eg, for last row in the above table, SD for some tasks is: {RTE: 1.57, MRPC: 0.87, MNLI: 0.15, QNLI: 0.096, SST: 0.21} across 5 seeds. We will consider adding SD in the updated version of paper.
1
0
1
@NamanGoyal21
Naman Goyal
1 year
@alex_conneau @OpenAI Congrats Alexis!!
0
0
1
@NamanGoyal21
Naman Goyal
5 months
@Ethan_smith_20 @dome_271 if your frozen params are not at the end and in between the transformer layers, you anyways need to compute dgrad for everything. you will be extra calculating wgrad though. which can be max 1/3rd slower. I agree not ideal, but easy thing to try.
0
0
0
@NamanGoyal21
Naman Goyal
2 years
@LChoshen @YebHavinga @BramVanroy @YinhanL @thoma_gu @xl_nlp Thanks for the question, every individual sample instance was always from language but within a batch, each sample could be from different languages.
2
0
1
@NamanGoyal21
Naman Goyal
3 years
@stefan_it_ @alex_conneau Looking into it, lets chat on github issue?
0
0
0