My Transformer tutorial slides are now available at
I'll append recordings to this thread as I get them.
If you want to use some of the slides for your lecture, you may, as long as you credit me.
If you'd like me to give the lecture: maybe; e-mail me.
Giving a lecture introducing the Transformer architecture in all gory details at
@M2lSchool
tomorrow. Also got permission to publish slides and will share recording if/when I get one.
It's a pretty cool set of slides, largely thanks to
@_basilM
for inspiration!
How good of a BERT can one get in ONE DAY on ONE GPU?
With all the recent studies about scaling compute up, this paper takes a refreshing turn and does a deep dive into scaling down compute.
It's well written, stock full of insights. Here is my summary and my opinions.
🧶 1/N
What makes CLIP work?
The contrast with negatives via softmax?
The more negatives, the better -> large batch-size?
We'll answer "no" to both in our ICCV oral🤓
By introducing SigLIP, a simpler CLIP that also works better and is more scalable, we can study the extremes.
Hop in🧶
Ah yes, the well known "except you, FAANG" clause that's so common in *open source* licenses like GPL, MIT, BSD, Apache2, ...
Here I go again, this can't be for real lol
This is huge: Llama-v2 is open source, with a license that authorizes commercial use!
This is going to change the landscape of the LLM market.
Llama-v2 is available on Microsoft Azure and will be available on AWS, Hugging Face and other providers
Pretrained and fine-tuned…
lol, draft of full GPT-4 paper with architecture and data details is already leaked on torrent😂
The vision component in the architecture is an interesting twist to plain ViT, and scaled up quite a bit!
Link to the torrent for the curious:
This quote from
@demishassabis
is my favourite take on the "engineering vs science" debates in AI yet:
AI is an engineering science: unlike in natural sciences, the phenomenon you're studying doesn't exist in nature, so you have to build it first, and then you can study.
1/N The return of patch-based self-supervision! It never worked well and you had to bend over backwards with ResNets (I tried).
Now with ViT, very simple patch-based self-supervised pre-training rocks! First BeIT, now Masked AutoEncoders i1k=87.8%
🧶
You're welcome, OpenAI. I'll share my home address in DM if you want to send us flowers and chocolate.
Actually, fun fact: one of the runner-ups for ViT's name was "ToP" meaning "Transformer on Patches". However, we ditched it because "the ToP model" was kinda borderline.
Oh my, code editors could be so much more beautiful! Below are two different ways to display the exact same code, taking up the same space: standard way first, and a beautiful mock-up second.
I love the idea and style:
Ilya Sutskever unambiguously confirming what we all knew but just wanted to hear admitted:
OpenAI's current closing up is for competitive reasons, not because of safety concerns
This is exactly what I hate with all big frameworks. TF is terrible. PyTorch used to be straightforward but turned terrible too. Torch7 was very direct. JAX/Flax still ok, but I pray every day that it doesn’t end up with the same fate over time.
Have you ever wanted to train LLMs in pure C without 245MB of PyTorch and 107MB of cPython? No? Well now you can! With llm.c:
To start, implements GPT-2 training on CPU/fp32 in only ~1,000 lines of clean code. It compiles and runs instantly, and exactly…
Holy cow, I consider myself an advanced matplotlib user, and I've never seen this before. So good. I should reconsider, and consider myself a noob again :)
`plt.subplot_mosaic(...)` is the single-most amazing
@matplotlib
function I'd never heard of 😍🤓🌍 Can't believe I've used Python for more than a decade and only just discovered it! Subplots will never be the same again 🌟
Hi everyone yes, I left OpenAI yesterday. First of all nothing "happened" and it’s not a result of any particular event, issue or drama (but please keep the conspiracy theories coming as they are highly entertaining :)). Actually, being at OpenAI over the last ~year has been…
Who killed non-contrastive image-text pretraining?
@AlecRad
and
@_jongwook_kim
with the below Fig2 in CLIP.
Who collected the 7 Dragonballs and asked Shenron to resurrect it? Yours truly, in this new paper of ours.
Generative captioning is not only competitive, it seems better!
Alright folks, after a full workday of discussion with lots of nuance and zero work done, we eventually solved everything. Alexey and
@MarioLucic_
suggested this solution, which I'm stealing and collecting the likes.
So you think you know distillation; it's easy, right?
We thought so too with
@XiaohuaZhai
@__kolesnikov__
@_arohan_
and the amazing
@royaleerieme
and Larisa Markeeva.
Until we didn't. But now we do again. Hop on for a ride (+the best ever ResNet50?)
🧵👇
Here's what our (sub)team in Zürich has done for OSS vision over the past 5y, besides inventing ViT:
1) Make i21k a thing
Release:
2) best CLIP (siglip) by a large margin
3) best i1k ResNet50 ever
4) best pre-trained ResNets
5) >55k ViTs
6) Most efficient JAX/TPU CV code
deets👇
It can't be repeated enough: learning-rate is the single most bang-for-buck thing you can tune.
If you think you know *ze best* learning-rate, it just means you only train standard stuff!
This is not a "secret trick" either; it's stated very clearly in THE deep-learning book:
On the importance of tuning at least learning rate for your experiments!!!
Here I just multiplied it by 10x and see the difference!
For most of my models, the same value is close to optimal so initially I got lazy but for this captioning model, a much larger value was needed.
Looks like the gzip paper I was enthusiastic about over-estimated its scores because of a bug in the code: it did top-2 knn instead of k=2.
We should remember this as (yet another) a strong case for testing in ml code.
I still like that it put a new idea in my toolbox.
Want to turn any vision backbone into an image-text model? Want to show the age-old "your model wouldn't recognize a cow on the beach" is a red herring?
That's LiT🔥 (Locked-image Tuning), a new alternative to fine-tuning that combines the best of fine-tuning and zero-shot
1/n🧶
It's about time: analog clock reading in the wild
A great example of an applied vision paper, let me walk you through why I like it. 🧶
They also make good use of Spatial Transformer Networks (STN) one of the most elegant ideas that usually don't work :)
Especially for computer vision folks: beware the LayerNorm pitfall!
How LN is used in CNNs is actually different from how it's used in Transformers (including ViT)
Figure below from by
@YueCao72324941
, Zhuliang Yao, Yutong Lin etal
This is wrong on multiple levels, ugh!
1. Don't get pressured into "not wasting your talent" bullshit. Just do whatever lets you enjoy life.
2. AI PhD != good founder or early engineer. Not by a mile.
3. Life with a proper salary is great.
4. There's cool research at BigCo.
PhD graduates in AI mostly take boring jobs at big tech companies due to short-term monetary incentives.
While understandable to some degree, it's also quite sad to see so many great researchers 'disappear' and give up their talent - join or do your own startup instead!
🧶PaLI-3 achieves SOTA across many vision-language (and video!) tasks while being 10x smaller than its predecessor PaLI-X.
At only 5B parameters, it's also smaller (and stronger) than the concurrent Fuyu-8B model, though sadly we cannot release the model (props to
@AdeptAILabs
)
It's Saturday. My 3yo is napping right now.
Once he wakes up, I'll go fire up some H100's and help him code some of the easy ideas I have in the back of my mind. We might do it just in time for NeurIPS.
Gotta start early and completely abuse my privilege, or so I heard🚀🚀🚀
This paper is all about large-scale pre-training of DL models. It completely lacks any mention of our work on this exact topic over the last >2 years. One of these is literally called "big transfer".
Do I need to write a Schmidthuber-like blogpost about all our group's work now?
Stanford's ~entire AI Department has just released a 200 page 100 author Neural Scaling Laws Manifesto.
They're pivoting to positioning themselves as
#1
at academic ML Scaling (e.g. GPT-4) research.
"On the Opportunities and Risks of Foundation Models"
I recently overheard something like "Aren't Transformers standard in Vision now?"
I wasn't sure. So this weekend, whenever the kid was asleep, I scraped, parsed, analyzed ALL CVF proceedings of the last decade to find out!
Methodology, code, request for feedback in thread.
GPT-5 live-testing its Q* DotA mode (internally known as QpenAI-5), just one of many capabilities that I've heard emerge at ultramassive scale.
Still some work left to do though.
so just played a game where we encountered an actual AI learning program as a teammate
dude last picks invoker, walks top and has midas que'd up. He nails sunstrikes, but plays super wierd.
We decide to chat a AI shutdown code and he stops moving
??
I really had to travel to HQ for a week to convince everyone they should just add ai to whatever they're doing. It's been an uphill battle, but they are slowly starting to!
Did you know we can use scaling laws not only to predict optimal nparams, but actually optimal model „shape“? (depth, width, MLP size)
Now you know!
With this, we get a 400m param plain ViT to 90.3 on ImageNet, matches ViT-g on many benchmarks.
Read more below or in our paper:
Excited to share our work on optimizing vision transformers. We advance scaling laws to infer compute-optimal model shapes, achieving better results with smaller models, eg. 90.3% in ImageNet with 400M params. This surpasses the much larger ViT-g!
abs:
Pleased to announce we are releasing checkpoints for our SigLIP models!
These are very strong image-text ViTs. We release them along with a colab to play around with. Most are english, but we also release a good i18n one.
Sorry, no magnet link mic drop. More in thread🧶
1/N After 3.8 wonderful years at Google Brain in Zürich, I have decided it is time for me to embark on a new adventure.
I'm thankful for all the amazing colleagues I've met so far, and hope to stay in touch and maybe even collaborate in the future.
My ambitious next venture:
Google presents:
Stealing Part of a Production Language Model
- Extracts the projection matrix of OpenAI’s ada and babbage LMs for <$20
- Confirms that their hidden dim is 1024 and 2048, respectively
- Also recovers the exact hidden dim size of gpt-3.5-turbo…
YES. Thanks Andrej. To this date still, way Way WAY too many people doing DL are way Way WAY too careless.
I think each small DL team needs at least two people who are obsessed with detail. But the team shouldn't be composed of solely such people either, or it'll go nowhere.
Beautiful work / attention to detail trying to get Gemma to finetune correctly. There are so many foot guns here to be super careful with. All of these issues don't throw any errors, they silently make your network worse.
A great example of what I wrote about in my "A Recipe for…
Video generation is the one thing I've actually been a long-term pessimist on.
But with this, and the recent paper that generated 1h consistent (but low q) videos w/ diffusion, I may have to change my mind.
Maybe if a big lab jumps on it, we'll get seriously impressed next year
My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning, on the tasks I tried it, it matched our heavily tuned existing setup!
If there is one thing the deep learning revolution has taught us, it's that neural nets will outperform hand-designed heuristics, given enough compute and data.
But we still use hand-designed heuristics to train our models. Let's replace our optimizers with trained neural nets!
People are jumping on this as something special, meanwhile I'm just sitting here thinking «someone slid a few examples like that into the probably very large SFT/IT/FLAN/RLHF/... dataset and thought "this will be neat" as simple as that»
Am I over simplifying? 🫣
Fun story from our internal testing on Claude 3 Opus. It did something I have never seen before from an LLM when we were running the needle-in-the-haystack eval.
For background, this tests a model’s recall ability by inserting a target sentence (the "needle") into a corpus of…
Beyond classification in vision, it always feels weird to optimize for a loss which doesn't _really_ match how we'll use the model later on*, but happens to be differentiable.
In our latest work, we tackle this discrepancy🧶
*unless the model is 100% perfect, which it never is.
Vision meets RL! We reveal that policy gradient can be used for tuning vision models to optimize complex metrics, such as mAP, PQ or “color diversity”, observing large performance boosts on tasks like object detection, panoptic segmentation, etc.
Seeing many simple Q's re Grok, let me answer w/o inside knowledge:
1. 😐benchmarks: a) raw model b) trained for interaction, not benchmarks.
2. Why tanh(30) attn? Avoid exploding logits.
3. gelu approx? Default in jax, most efficient.
4. 340b useless? not made for u.
cont/
The ConvNeXt paper is rightfully getting some attention: it's good work and has beautiful plots.
But, Fig1 needs a little correction IMO. They compare heavily aug/reg swin+convnext to plain ViT. We fixed this in which is what should always be compared to.
A ConvNet for the 2020s
abs:
github:
Constructed entirely from standard ConvNet modules, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation
1/N The return of patch-based self-supervision! It never worked well and you had to bend over backwards with ResNets (I tried).
Now with ViT, very simple patch-based self-supervised pre-training rocks! First BeIT, now Masked AutoEncoders i1k=87.8%
🧶
It happened!
Today I saw a video tutorial on TikTok explaining how to make a trading bot with the help of chatGPT, which created s strategy that provides an insane 42000% profit.
AI is amazing. I took tomorrow off to implement this, and then I'll see you on my yacht, suckers!
If you haven't read our latest ImageNet SOTA work "Vision Transformers (ViT)" yet, shame on you. But! There's hope! Here's the corresponding blogpost which is a nice tl;dr:
Disagree. As soon as you throw sparsity in (and depthwise/tiny-group conv is a form of sparsity) FLOPs detach from reality.
That's why sparse nets are hard (), and EffNetV2 actually UNDOES a lot of depthwise.
EffNetV1 == MobileNetV3 == designed for CPU.
Some interesting discussion on r/machinelearning about EfficientNet and CNN efficiency.
TBH, I think FLOPS as a measurement of models sometimes gets a bad rap. It has its downsides, but it's one of the harder metrics to "game".
Giving a lecture introducing the Transformer architecture in all gory details at
@M2lSchool
tomorrow. Also got permission to publish slides and will share recording if/when I get one.
It's a pretty cool set of slides, largely thanks to
@_basilM
for inspiration!
Wow! FAIR was home to the best computer vision researchers. But over the last couple years, one by one, they left. It's now a shell of its former self.
This is not a hate post: I like and admire them. But wonder what went wrong. I'd love to buy a book that tells their story.
Soooo language folks have rediscovered generating and ensembling multiple predictions at test-time helps.
In vision, it’s multi-crop eval ; we know it works, but collectively decided to stop reporting it.
but… somehow… now we call it agents? Did I get this right?
Now anyone can download and play with a TRILLION-parameter language model!
I'm obviously biased, but happy that Google Brain is showing some good leadership in the right direction for science here, by allowing them to release the model, no-nonesense.
PS: I haven't tried it yet!
Today we're releasing all Switch Transformer models in T5X/JAX, including the 1.6T param Switch-C and the 395B param Switch-XXL models. Pleased to have these open-sourced!
All thanks to the efforts of James Lee-Thorp,
@ada_rob
, and
@hwchung27
It literally does not matter the domain: sports, art, engineering, music, carpentry, ... I just love watching (or reading about) people at the absolute top of their game.
Even without knowing much about the domain, it's often easy to tell who's in a league of their own.
This comment I screenshotted below is a really on point description of the current vibe I feel. I share the commenter’s fear.
Though it’s not all lost yet! For instance, I’m involved in 4 neurips submissions and had 2 iccv ones. Really hoping our openness keeps going like this💪
I think this post channels what a lot of people in the AI community feels right now. As if the stark hypocrisy wasn't enough, there is now also the blatant gatekeeping attempts. I've started thinking worse of people who choose to still work for them.
Most recent large transformer decoders use this trick of having multiple heads for the queries, but only one for the keys/values.
I always thought it’s a small not well documented trick of the trade. But no, there’s a nice paper about „multi query attention“, of course by Noam.
Our Mixer has been cooking for a while!
We present a novel architecture composed of only MLPs. Look 'ma, no conv, no attention at all. And it works as good as the best ResNets and ViTs.
Major props to Ilya Tolstikhin and
@neilhoulsby
who led this investigation.
#Parti
: A photo of an astronaut riding a horse in the forest. There is a river in front of them with water lilies.
I ike this visually very clear effect of the benefits of scale:
A new image generation model just dropped.
Great work by the team!
+ Auto-regressive, encoder->decoder Transformer
+ Classifier-free sampling.
+ ViT-VQGAN
Really amazing results: Image from the website.
Meanwhile, in central/west EU (east idk) as a PhD student you get:
- a standard wage you can live off just fine, no roommates needed.
- not just your own desk, but often a 2-person office (!)
- in return, you have to teach ~half the time & grade exams
I recommend doing PhD here.
LongNet/1B seqlen. Saving you the click:
- is using hierarchical dilated attention similar to (but not same as) BigBird.
- no experiment longer than 32k
- scaling curve at least seem not pessimistic
So I’ll wait for v2 which actually scales this.
LongNet: Scaling Transformers to 1,000,000,000 Tokens
Presents LONGNET, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences
abs:
repo:
Allow me a brief moment of not-so-humble brag time?
I had 6 CVPR submissions*, for which:
6/6 I wrote code/ran experiments.
4/6 I'm co-first-author.
6/6 avg(reviews) > borderline.
4/6 accepted.
Pretty happy!
*fine-print: but 3 of them are re-submissions, I'm no super-human :)
If this pans out to work robustly across models and tasks, i think this could be one of the rare huge breakthroughs that in a few years we'll wonder «what took us this long?»
For the first time, we show that the Llama 7B LLM can be trained on a single consumer-grade GPU (RTX 4090) with only 24GB memory. This represents more than 82.5% reduction in memory for storing optimizer states during training.
Training LLMs from scratch currently requires huge…
This shit has to be taught in schools when my kid gets to puberty. This is a whole new genre like cubism, baroque, etc.
(I'm uneducated about art, but love this one - does it already have a name?)
Testing how our Unified Vision Model (UViM) works on the notoriously difficult, AGI-hard, "cow on beach" task.
Of course, no such picture exists: it's completely OOD form the real world. So I had to
#imagen
some.
Then ask UViM to panoptic segment them.
Please read full🧶
I've always been frustrated that, beyond image classification, computer vision is full of complex and task-specific components.
Thus, very excited to share our new work, where we propose a unified modeling approach for vision: .
More in the thread🧵.
Paper writing protip:
Most papers are not read end-to-end. Ain't nobody got time. Write with that in mind. Make sections, figs, tables and their captions as self-contained and "guessable" as reasonably possible.
Example: call your models Foo-M and Foo-S instead of Foo and Foo*
9/9 final thoughts.
- I really like the "trend reversal" of seeing how much can be done with limited compute.
- I am a big fan of the gray text passages for things that were tried but didn't work.
- The lr sched part is fishy, but not super important.
- Impressive bibliography!
You know what's my favourite part with our Gemma release?
That we do not misuse the term "open source" like other labs have. It was explicit in the comms briefing that we should call them "open models" and not "open source models". Much respect to the team.
What are some computer-vision tasks that are actually useful IRL and cannot be done by any of the current gen LLM chatbot with image input?
Not looking for academic made-up benchmarks or brain-teaser tasks, only things that actually help you do stuff IRL.
Yann is trying to erase history!!
Before luatorch, there was in fact Torch3 (C++) and it had the most legendary author pictures of a software library to date. I’m not making this up:
If Claude 2 turns out to be as strong as GPT-4, thereby breaking the OpenAI monopoly on strong LMing, the number of companies building products on top of LMs will increase substantially.
There's (almost) nothing better on this earth than polishing a fancy matplotlib figure while listening to nice music and having a good (Belgian) beer or cappuccino.
Can't share the current one yet, so here are some past ones that I like, just because. (arxiv links in alt-text.)
This is *exactly* what I had in mind when disliking the term "emergent" recently.
It seems due to the metrics (like binary correct/incorrect), in reality the model does smoothly approach the right answer.
But I was too lazy to verify this intuition myself, glad this paper did!
Are Emergent Abilities of Large Language Models a Mirage?
present explanation in a simple mathematical model, then test it in three complementary ways: (1) make, test and confirm three predictions on the effect of metric choice using the InstructGPT/GPT-3 family on tasks with…
@eladgil
@patrickc
In AI at least, the real 30 under 30 imo you have never heard of. They are 5 layers down the org chart from the CEO. They are usually not on Twitter, they have an unmaintained LinkedIn, they don’t go on podcasts, and they maybe published at one point but don’t do so anymore. They…
Our NeurIPS'21 workshop on "ImageNet: past, present, and future" has been accepted!
I'm excited about our speaker line-up. I'm even more excited to see what papers researchers will submit to the workshop!
Please spread the word, and consider submitting.
This is actually the most sensible take I’ve read so far: Sam may have tried starting/running too many other startups on the side, that could become huge on the back of OpenAI, and may not have openly disclosed all of them?
They all make a lot of sense too!
1/4 Did you know bfloat16 stands for Brain Float16 and was invented by Google Brain for stable and fast NN training?
I feel like the rest of the world thinks half-precision training has to be painful, because nvidia didn't implement bf16 forever and f16 sucks (loss scaling??).
I agree. Personally I still like the term "pre-trained models". It's short, clear and to the point.
The "large" part I feel is a current necessity, but not a key property. I think currently it's used to imply "rly good", but in the future we might get equally good small models.
Reminder to everyone starting to publish in ML: "Foundation models" is *not* a recognized ML term; was coined by Stanford alongside announcing their center named for it; continues to be pushed by Sford as *the* term for what we've all generally (reasonably) called "base models".
Most people have absolutely no sense for the insane diversity of things covered in O(billion) web images.
I'm not sure it is meaningful to talk about ood, distribution shift, generalisation, etc. anymore at that scale.
It will take the collective us some time to digest this.
Do language models have an internal world model? A sense of time? At multiple spatiotemporal scales?
In a new paper with
@tegmark
we provide evidence that they do by finding a literal map of the world inside the activations of Llama-2!
So OpenAI is being sued now. Stability and Midjourney are already being sued. Things are getting "interesting".
Is there a website or someone to follow that writes summaries and updates covering these "ai lawsuits"?
I want to closely follow, but also am lazy.
I promised a thread this weekend about OpenAI and the lawsuit I filed against them, and an explanation of what I hope to achieve here. Sorry for the length, but there's a lot going on here.
To begin with, we need to understand what “OpenAI” really is: a poorly constructed scheme…
That’s why we need to get rid of tokenizers and try to use raw inoutput, like in vision! ByT5 () and MEGABYTE () make nice first steps, we need more of that.
🔥Hot Take?🧨🧑🚒
LLM alignment starts with biases in token embeddings.
If you can't get that part right, reinforcement learning and/or a few thousand example chats isn't going to help!
We don't cite our tools enough.
I want to "boilerplate cite" all important tools in future papers, they deserve the credit. My candidates:
- numpy
- matplotlib
- jax
- TPUs (XLA?)
- Jupyter (colab)
What are yours? Which do I miss?
PS: I used to do this a bit, but lost habit:
@sama
A big Transformer-style robot taking an image, cutting it into a grid of 16x16 small patches, eating those patches up. Once done, a comicbook-style text bubble is shown, indicating the robot saying the words "mmm, this image was definitely worth 16x16 patches."
See, LLMs don’t magically get skills out of thin air, as some papers suggest.
This is a very nice paper taking a deep dive into one of them (translation skill) and it clearly comes from it being in the data.
I think that’s great and a good motivator for training on everything!
🔎1.4% of PALM’s training instances are detected as bilingual, while 0.34% contain at least one translated sentence pair. We were able to mine such pairs across all languages studied; therefore, none of these languages is truly zero-shot in the context of translation.
In the same spirit, I keep preaching:
In today’s age, please stop taking test set as IID split from the training data. Create large noisy training data (or even none!), but *small, very high quality* test data.
We currently suffer from benchmarking on low-quality test sets.
Can you reliably evaluate your model with just a handful of test examples? Yes, you often can!
Anchor Points are tiny -- but surprisingly representative -- subsets of benchmarks. They can predict which other points the model will fail on… without evaluating on those points! 🧵
getting a lot of DMs asking how to get into computer vision. i am no expert, i can only share what i did:
1. follow
@giffmana
2. read all of his papers
3. watch recordings of all of his talks on youtube
4. study every tweet he posts for extra alpha
Matting = creating an alpha mask to cutout a foreground object. Think of background effects in video-conf.
ViTMatte shows how to adapt plain, generally pre-trained ViTs to perform SOTA Matting.
I'll walk you through the paper and give context on ViT for detailed outputs:
Anyone else permanently annoyed by the mismatched length of True/False keywords? Even the strings yes/no have mismatched lengths, who the fuck invented English?
I present to you my newest solution to this eternal thorn in the eye: