Unfortunately A/B testing is tough: it requires lots of subjects and/or well-defined use patterns. In lieu of that, my favorite eval method is "find someone who has spent way too much time using way too many models and ask them to do a vibe check". Reviewers don't love this tho. Tweet added by Matthew Leavitt @leavittron

Matthew Leavitt

7 months

Unfortunately A/B testing is tough: it requires lots of subjects and/or well-defined use patterns. In lieu of that, my favorite eval method is "find someone who has spent way too much time using way too many models and ask them to do a vibe check". Reviewers don't love this tho.

2

5

45

Matthew Leavitt

@leavittron

7 months

A few (somewhat data-centric) thoughts on the Gemini whitepaper 🧵: Can't be much more direct than this: "We find that data quality is critical to a highly-performing model". It feels especially true cuz they provide next to no information on the training data.

3

20

187

Matthew Leavitt

@leavittron

7 months

Despite Gemini explicitly acknowledging the importance of data quality, I’m sure ML twitter will keep perseverating on the importance of architecture choices like the “efficient attention mechanisms” that the report also mentions

1

3

33

Matthew Leavitt

@leavittron

7 months

Gemini also continues the trend of training small models for looonger. As deep learning models transition from research artifact to production necessity, inference costs are going to increasingly dominate the economics. Llongboi just keeps getting llonger:

Naveen Rao

@NaveenGRao

1 year

Ok, for those wondering about the origin of our nickname "Llongboi", here it is. ( @jefrankle got mad at me for putting this in the wild. Once it's free, it's free!)

1

0

22

1

0

26

Matthew Leavitt

@leavittron

7 months

One very relevant consequence of token budgets increasing is that the need for data curation also increases! The quantity (and possibly even proportion 😱) of redundant, noisy, and misleading examples increases with the size of your dataset!

1

0

16

Matthew Leavitt

@leavittron

7 months

The Gemini whitepaper also emphasizes the importance of training the tokenizer on a “large sample” of the dataset. IMO tokenizers as a vector for model improvement are vastly underexploited. Data curation and tokenization both suffer because researchers overlook data.

1

35

Matthew Leavitt

@leavittron

7 months

Gemini Ultra training was distributed across datacenters! Model parallel within SuperPods (and datacenters) and data parallel across SuperPods (and datacenters)! This is impressive in part because gradients are notoriously shy and reluctant to leave their home datacenter.

1

23

Matthew Leavitt

@leavittron

7 months

TFW cosmic rays ruin your training run. To be fair, most SDC events probably aren't due to cosmic rays, but it's fun to think about the universe extending a glittering tendril into the delicate gears of your trainer and whispering "nope".

3

6

73

Matthew Leavitt

@leavittron

7 months

The benchmark evals are pretty impressive compared to the baselines, though I'm always skeptical of benchmarks. A/B testing seems like the most direct way to measure model quality IMO...which they do! Though not against GPT4 ☹️

1

0

10

Matthew Leavitt

@leavittron

7 months

Overall the Gemini work is very technically impressive and highlights the importance of data quality. I'm excited for everyone to vibe check the Ultra model once it's available, and to see what it's the subsequent data sheets/white papers when they're released.

2

0

10

John E-gen 🔮

@TheJohnEgan

6 months

@leavittron @jeremyphoward vivo- vibe in , vibe out

0

1

Replies