Naman Goyal @NamanGoyal21 profile

Naman Goyal

@NamanGoyal21

Followers

1,035

Following

563

Media

1

Statuses

176

Research engineer, LLM scaling at GenAI Meta | Worked on: llama2, llama, OPT, blenderbot, XLMR, Bart, Roberta

Joined November 2012

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

#母の日 • 289785 Tweets

#MothersDay • 197555 Tweets

ヴィクトリアマイル • 165417 Tweets

テンハッピーローズ • 95888 Tweets

ストフェス • 86341 Tweets

#AnnelerGünü • 70331 Tweets

カーネーション • 48557 Tweets

Roger Corman • 46302 Tweets

マスクトディーヴァ • 39174 Tweets

ナミュール • 38523 Tweets

WIN5 • 35126 Tweets

津村騎手 • 29894 Tweets

#ユーフォ3期 • 26372 Tweets

#やまラスト • 22441 Tweets

ツムツム • 17865 Tweets

#GO時間生配信 • 13564 Tweets

ヴェルディ • 13082 Tweets

#CRカップ • 12761 Tweets

才木完封 • 11724 Tweets

#鉄腕DASH • 11600 Tweets

#ファミコン全国一斉クイズ

セクブラ

DONBELLE BOX OFFICE LEGACY

ジンジャーブレッド

れいこさん

アルティーリ

サイスニード

親愛フルボイス

大関全滅

キンタロー

単勝200倍

ビシエド

津村さん

B1昇格

MC NUNEW EP1

熱海富士

才木くん

かずのこさん

照ノ富士

葉月ちゃん

れおほー

Muttertag

ファイターズ

KAmiYU

オークス

初恋ドア

ワールドツアー

第10回

Abdulkadir Uraloğlu

#堂本光一FC

Last Seen Profiles

@TWI_Ltd

@turk_ifsa2019

@ABUBAKKARREALTO

@Daviddileo14

@cobymayox

@nwaf_rq

@JenPanepinto

@BIGTIMMMM

@jess_malloy

@ESMWorcester

@AlirezaDalvi

@plussizeotd

@carkid_jump23

@coulterk80

@naomi_barns

@ConceptualBooks

@rrr_aoringo

@denis_mogilenko

@saviorcomplex_x

@MhorRitz

Naman Goyal

@NamanGoyal21

1 year

Its crazy that, at 60% Model FLOPS (FP8) Utilization on H100, original GPT3 configuration can be trained in 3 days on 1024 H100s and PaLM on 12 days on 2048 H100s. That's roughly 50x lesser gpu hours for GPT3 paper 3 years back, and 9x lesser for palm released 9 months back

13

37

392

Naman Goyal

@NamanGoyal21

7 months

One of the unfortunate things GPT4 architecture leak caused is, convincing many smart researchers across various labs that sparse models are the way to achieve GPT4 quality model. Data quality and FLOPS is all that matters is such a simple yet hard paradigm to follow.

8

9

227

Naman Goyal

@NamanGoyal21

7 months

Finished 30/30 radiation therapy sessions today. Past 3-4 months have been one of the most challenging part of my life. Recovery from surgery and radiation therapy was quite physically and mentally challenging. With due respect, Cancer, please stay from me from now on.

23

0

177

Naman Goyal

@NamanGoyal21

4 years

Facebook AI Research's sequence modeling library @fairseq has made it's twitter debut. Please follow for latest updates.

2

10

43

Naman Goyal

@NamanGoyal21

6 months

@StasBekman the best GEMM tflops across various large matmul sizes I got for A100 was ~286. And for H100 it’s currently about 750-800 for bf16 and 1550-1600 for fp8. I think and hope GEMM performance will improve overtime as NVIDIA optimises matmul for h100 more with newer cublas versions

2

0

12

Naman Goyal

@NamanGoyal21

1 year

@jekbradbury I do agree that its gonna need decent amount of engineering work! Though my guess is any team with access to stable H100 cluster with 400Gbps Infiniband (or similar interconnect should reach there by max end of year.

2

1

10

Naman Goyal

@NamanGoyal21

4 years

@ragtdata @facebookai @ylecun Yes, we are going to release the pretrained models soon.

0

6

Naman Goyal

@NamanGoyal21

5 months

@dome_271 The easiest way would be to disable flatten params and then set lr 0.0 for the params you don’t wanna update. I think after setting flatten params false, setting requries_grad=False also should work, but gotta check this to be sure

2

0

4

Naman Goyal

@NamanGoyal21

2 years

@borisdayma @andrew_n_carr We recently noticed that the scalar of LN is also not needed at least beyond 6.7B model scale.

1

0

4

Naman Goyal

@NamanGoyal21

2 years

@giffmana @_arohan_ @achowdhery @arankomatsuzaki And also from one less inter GPU communication within tensor parallel gpus, which PaLM was doing. Interestingly, we are able to remove scaler of LN also without losing on PPL, so it's just normalization that helps.

2

0

4

Naman Goyal

@NamanGoyal21

1 year

@_arohan_ Congrats on the release. one quick question though if you dont mind. I am not unable to understand what those tokens mean in the table 1, as flop=6ND doesnt seem to match and the abs value look to be too low. Could there be typo or some misunderstanding on my side? cc: @YiTayML

0

4

Naman Goyal

@NamanGoyal21

2 years

@SashaMTL @Jsevillamol I was curious too, above link seems to show 90 kg CO2 per hour per passenger, assuming 333 passengers. So for a full flight it seems to be ~700 * 333 = ~233 tons, which seem to be very close to the ~271 tons.

1

0

4

Naman Goyal

@NamanGoyal21

9 months

@zhansheng I fine tuned MLMs (Roberta, Bart, xlmr from 100M to 10B scale) bunch around that time, but can’t remember this behavior specific to fine tuning. Mainly big models overall were less stable and for that I think two things changed: pre layer norm and bf16

0

4

Naman Goyal

@NamanGoyal21

2 years

@StasBekman My guess would be 2, 4 and 3 in that order.

1

0

3

Naman Goyal

@NamanGoyal21

1 year

@arkerr Amazing!!! also curious what are the timelines for FP8 GEMM support in cutlass?

0

3

Naman Goyal

@NamanGoyal21

1 year

@OriolVinyalsML I have one request for flamingo output, I am really curious how does Flamingo do on this classic example from Andrej Karpathy's 10 years old blog post

1

0

3

Naman Goyal

@NamanGoyal21

1 year

@YiTayML @artetxem Congratulations, Mikel is amazing! Looking forward to really look things!

0

3

Naman Goyal

@NamanGoyal21

1 year

@_arohan_ I remember this paper () from Google Brain used MoE (gshard / switch transformer style configuration ) in VIT models to scale up to 15B parameters. It's conditional compute, so maybe not what you meant to ask?

1

0

2

Naman Goyal

@NamanGoyal21

5 years

@lyeskhalil @UofT @uoftmie @uoftengineering @IVADO_Qc @polymtl @69alodi @BDilkina Wow!!! Congrats Elias!!!

0

1

Naman Goyal

@NamanGoyal21

3 years

@StasBekman It’s totally unrelated to divergence but usually it’s a good idea to keep model_dim % num_heads a power of two instead of 0. I have seen empirical speed up with that

2

0

1

Naman Goyal

@NamanGoyal21

6 months

@StasBekman I don’t know of any publicly available semi-official numbers. Plus I think it might vary a bit with exact configuration of the server, power capping or not, type of cooling etc

1

0

1

Naman Goyal

@NamanGoyal21

5 years

@annargrs @myleott @vesko_st @LukeZettlemoyer @omerlevy_ @YinhanL @mandarjoshi_ @danqi_chen Yes, the variance for RTE and MRPC is higher compared to tasks with bigger datasets. Eg, for last row in the above table, SD for some tasks is: {RTE: 1.57, MRPC: 0.87, MNLI: 0.15, QNLI: 0.096, SST: 0.21} across 5 seeds. We will consider adding SD in the updated version of paper.

1

0

1

Naman Goyal

@NamanGoyal21

1 year

@alex_conneau @OpenAI Congrats Alexis!!

0

1

Naman Goyal

@NamanGoyal21

5 months

@Ethan_smith_20 @dome_271 if your frozen params are not at the end and in between the transformer layers, you anyways need to compute dgrad for everything. you will be extra calculating wgrad though. which can be max 1/3rd slower. I agree not ideal, but easy thing to try.

0

Naman Goyal

@NamanGoyal21

2 years

@LChoshen @YebHavinga @BramVanroy @YinhanL @thoma_gu @xl_nlp Thanks for the question, every individual sample instance was always from language but within a batch, each sample could be from different languages.

2

0

1

Naman Goyal

@NamanGoyal21

3 years

@stefan_it_ @alex_conneau Looking into it, lets chat on github issue?

0