Unfortunately A/B testing is tough: it requires lots of subjects and/or well-defined use patterns. In lieu of that, my favorite eval method is "find someone who has spent way too much time using way too many models and ask them to do a vibe check". Reviewers don't love this tho.
A few (somewhat data-centric) thoughts on the Gemini whitepaper 🧵:
Can't be much more direct than this: "We find that data quality is critical to a highly-performing model". It feels especially true cuz they provide next to no information on the training data.
Despite Gemini explicitly acknowledging the importance of data quality, I’m sure ML twitter will keep perseverating on the importance of architecture choices like the “efficient attention mechanisms” that the report also mentions
Gemini also continues the trend of training small models for looonger. As deep learning models transition from research artifact to production necessity, inference costs are going to increasingly dominate the economics. Llongboi just keeps getting llonger:
Ok, for those wondering about the origin of our nickname "Llongboi", here it is.
(
@jefrankle
got mad at me for putting this in the wild. Once it's free, it's free!)
One very relevant consequence of token budgets increasing is that the need for data curation also increases! The quantity (and possibly even proportion 😱) of redundant, noisy, and misleading examples increases with the size of your dataset!
The Gemini whitepaper also emphasizes the importance of training the tokenizer on a “large sample” of the dataset. IMO tokenizers as a vector for model improvement are vastly underexploited. Data curation and tokenization both suffer because researchers overlook data.
Gemini Ultra training was distributed across datacenters! Model parallel within SuperPods (and datacenters) and data parallel across SuperPods (and datacenters)! This is impressive in part because gradients are notoriously shy and reluctant to leave their home datacenter.
TFW cosmic rays ruin your training run.
To be fair, most SDC events probably aren't due to cosmic rays, but it's fun to think about the universe extending a glittering tendril into the delicate gears of your trainer and whispering "nope".
The benchmark evals are pretty impressive compared to the baselines, though I'm always skeptical of benchmarks. A/B testing seems like the most direct way to measure model quality IMO...which they do! Though not against GPT4 ☹️
Overall the Gemini work is very technically impressive and highlights the importance of data quality. I'm excited for everyone to vibe check the Ultra model once it's available, and to see what it's the subsequent data sheets/white papers when they're released.