Lucas Beyer (bl16) Profile Banner
Lucas Beyer (bl16) Profile
Lucas Beyer (bl16)

@giffmana

Followers
56,300
Following
447
Media
939
Statuses
11,041

Researcher (Google DeepMind/Brain in Zürich, ex-RWTH Aachen), Gamer, Hacker, Belgian. Mostly gave up trying mastodon as lb @sigmoid .social

Zürich, Suisse
Joined December 2013
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@giffmana
Lucas Beyer (bl16)
2 years
My Transformer tutorial slides are now available at I'll append recordings to this thread as I get them. If you want to use some of the slides for your lecture, you may, as long as you credit me. If you'd like me to give the lecture: maybe; e-mail me.
Tweet media one
@giffmana
Lucas Beyer (bl16)
2 years
Giving a lecture introducing the Transformer architecture in all gory details at @M2lSchool tomorrow. Also got permission to publish slides and will share recording if/when I get one. It's a pretty cool set of slides, largely thanks to @_basilM for inspiration!
11
14
324
48
502
3K
@giffmana
Lucas Beyer (bl16)
1 year
How good of a BERT can one get in ONE DAY on ONE GPU? With all the recent studies about scaling compute up, this paper takes a refreshing turn and does a deep dive into scaling down compute. It's well written, stock full of insights. Here is my summary and my opinions. 🧶 1/N
Tweet media one
47
638
3K
@giffmana
Lucas Beyer (bl16)
9 months
What makes CLIP work? The contrast with negatives via softmax? The more negatives, the better -> large batch-size? We'll answer "no" to both in our ICCV oral🤓 By introducing SigLIP, a simpler CLIP that also works better and is more scalable, we can study the extremes. Hop in🧶
Tweet media one
27
290
2K
@giffmana
Lucas Beyer (bl16)
10 months
Ah yes, the well known "except you, FAANG" clause that's so common in *open source* licenses like GPL, MIT, BSD, Apache2, ... Here I go again, this can't be for real lol
Tweet media one
@ylecun
Yann LeCun
10 months
This is huge: Llama-v2 is open source, with a license that authorizes commercial use! This is going to change the landscape of the LLM market. Llama-v2 is available on Microsoft Azure and will be available on AWS, Hugging Face and other providers Pretrained and fine-tuned…
428
4K
16K
62
160
2K
@giffmana
Lucas Beyer (bl16)
1 year
lol, draft of full GPT-4 paper with architecture and data details is already leaked on torrent😂 The vision component in the architecture is an interesting twist to plain ViT, and scaled up quite a bit! Link to the torrent for the curious:
62
194
1K
@giffmana
Lucas Beyer (bl16)
2 years
This quote from @demishassabis is my favourite take on the "engineering vs science" debates in AI yet: AI is an engineering science: unlike in natural sciences, the phenomenon you're studying doesn't exist in nature, so you have to build it first, and then you can study.
24
142
1K
@giffmana
Lucas Beyer (bl16)
2 years
1/N The return of patch-based self-supervision! It never worked well and you had to bend over backwards with ResNets (I tried). Now with ViT, very simple patch-based self-supervised pre-training rocks! First BeIT, now Masked AutoEncoders i1k=87.8% 🧶
Tweet media one
16
230
1K
@giffmana
Lucas Beyer (bl16)
3 months
You're welcome, OpenAI. I'll share my home address in DM if you want to send us flowers and chocolate. Actually, fun fact: one of the runner-ups for ViT's name was "ToP" meaning "Transformer on Patches". However, we ditched it because "the ToP model" was kinda borderline.
Tweet media one
29
74
1K
@giffmana
Lucas Beyer (bl16)
8 months
Oh my, code editors could be so much more beautiful! Below are two different ways to display the exact same code, taking up the same space: standard way first, and a beautiful mock-up second. I love the idea and style:
Tweet media one
117
67
804
@giffmana
Lucas Beyer (bl16)
1 year
Ilya Sutskever unambiguously confirming what we all knew but just wanted to hear admitted: OpenAI's current closing up is for competitive reasons, not because of safety concerns
28
75
707
@giffmana
Lucas Beyer (bl16)
25 days
This is exactly what I hate with all big frameworks. TF is terrible. PyTorch used to be straightforward but turned terrible too. Torch7 was very direct. JAX/Flax still ok, but I pray every day that it doesn’t end up with the same fate over time.
Tweet media one
@karpathy
Andrej Karpathy
25 days
Have you ever wanted to train LLMs in pure C without 245MB of PyTorch and 107MB of cPython? No? Well now you can! With llm.c: To start, implements GPT-2 training on CPU/fp32 in only ~1,000 lines of clean code. It compiles and runs instantly, and exactly…
306
2K
13K
27
44
715
@giffmana
Lucas Beyer (bl16)
2 years
Holy cow, I consider myself an advanced matplotlib user, and I've never seen this before. So good. I should reconsider, and consider myself a noob again :)
@leifdenby
Leif Denby
2 years
`plt.subplot_mosaic(...)` is the single-most amazing @matplotlib function I'd never heard of 😍🤓🌍 Can't believe I've used Python for more than a decade and only just discovered it! Subplots will never be the same again 🌟
Tweet media one
49
517
4K
9
56
673
@giffmana
Lucas Beyer (bl16)
6 months
Jeff Dean facts are no joke. If you don’t know: it’s the CS equivalent of Chuck Norris facts, for example:
Tweet media one
15
62
665
@giffmana
Lucas Beyer (bl16)
3 months
Friends telling me his last two yolo-runs were only given 256 and 64 GPUs, respectively, so he was like "fuck that shit".
@karpathy
Andrej Karpathy
3 months
Hi everyone yes, I left OpenAI yesterday. First of all nothing "happened" and it’s not a result of any particular event, issue or drama (but please keep the conspiracy theories coming as they are highly entertaining :)). Actually, being at OpenAI over the last ~year has been…
2K
1K
23K
17
17
604
@giffmana
Lucas Beyer (bl16)
11 months
Who killed non-contrastive image-text pretraining? @AlecRad and @_jongwook_kim with the below Fig2 in CLIP. Who collected the 7 Dragonballs and asked Shenron to resurrect it? Yours truly, in this new paper of ours. Generative captioning is not only competitive, it seems better!
Tweet media one
Tweet media two
Tweet media three
17
93
582
@giffmana
Lucas Beyer (bl16)
3 years
Alright folks, after a full workday of discussion with lots of nuance and zero work done, we eventually solved everything. Alexey and @MarioLucic_ suggested this solution, which I'm stealing and collecting the likes.
Tweet media one
44
41
563
@giffmana
Lucas Beyer (bl16)
3 years
So you think you know distillation; it's easy, right? We thought so too with @XiaohuaZhai @__kolesnikov__ @_arohan_ and the amazing @royaleerieme and Larisa Markeeva. Until we didn't. But now we do again. Hop on for a ride (+the best ever ResNet50?) 🧵👇
Tweet media one
8
114
567
@giffmana
Lucas Beyer (bl16)
6 months
Here's what our (sub)team in Zürich has done for OSS vision over the past 5y, besides inventing ViT: 1) Make i21k a thing Release: 2) best CLIP (siglip) by a large margin 3) best i1k ResNet50 ever 4) best pre-trained ResNets 5) >55k ViTs 6) Most efficient JAX/TPU CV code deets👇
11
56
554
@giffmana
Lucas Beyer (bl16)
3 months
It can't be repeated enough: learning-rate is the single most bang-for-buck thing you can tune. If you think you know *ze best* learning-rate, it just means you only train standard stuff! This is not a "secret trick" either; it's stated very clearly in THE deep-learning book:
Tweet media one
@borisdayma
Boris Dayma 🖍️
3 months
On the importance of tuning at least learning rate for your experiments!!! Here I just multiplied it by 10x and see the difference! For most of my models, the same value is close to optimal so initially I got lazy but for this captioning model, a much larger value was needed.
Tweet media one
6
6
126
19
46
550
@giffmana
Lucas Beyer (bl16)
10 months
Looks like the gzip paper I was enthusiastic about over-estimated its scores because of a bug in the code: it did top-2 knn instead of k=2. We should remember this as (yet another) a strong case for testing in ml code. I still like that it put a new idea in my toolbox.
@amilios
Aristides
10 months
As much as I wanted the gzip-beats-BERT to be true, it doesn't seem like it is:
22
125
706
12
62
539
@giffmana
Lucas Beyer (bl16)
2 years
Want to turn any vision backbone into an image-text model? Want to show the age-old "your model wouldn't recognize a cow on the beach" is a red herring? That's LiT🔥 (Locked-image Tuning), a new alternative to fine-tuning that combines the best of fine-tuning and zero-shot 1/n🧶
Tweet media one
6
96
523
@giffmana
Lucas Beyer (bl16)
2 years
It's about time: analog clock reading in the wild A great example of an applied vision paper, let me walk you through why I like it. 🧶 They also make good use of Spatial Transformer Networks (STN) one of the most elegant ideas that usually don't work :)
Tweet media one
8
97
518
@giffmana
Lucas Beyer (bl16)
2 months
Holy fucking fuckshit, you can't be serious!? There goes my precious coding Friday, ffs.
Tweet media one
43
15
516
@giffmana
Lucas Beyer (bl16)
1 year
Especially for computer vision folks: beware the LayerNorm pitfall! How LN is used in CNNs is actually different from how it's used in Transformers (including ViT) Figure below from by @YueCao72324941 , Zhuliang Yao, Yutong Lin etal
Tweet media one
@francoisfleuret
François Fleuret
1 year
Tweet media one
13
81
515
16
88
504
@giffmana
Lucas Beyer (bl16)
8 months
This is wrong on multiple levels, ugh! 1. Don't get pressured into "not wasting your talent" bullshit. Just do whatever lets you enjoy life. 2. AI PhD != good founder or early engineer. Not by a mile. 3. Life with a proper salary is great. 4. There's cool research at BigCo.
@MattNiessner
Matthias Niessner
8 months
PhD graduates in AI mostly take boring jobs at big tech companies due to short-term monetary incentives. While understandable to some degree, it's also quite sad to see so many great researchers 'disappear' and give up their talent - join or do your own startup instead!
47
46
573
18
9
491
@giffmana
Lucas Beyer (bl16)
5 months
Not at NeurIPS and missing out on so much, but... nothing beats lone heads-down coding in my favourite spot.
Tweet media one
20
5
472
@giffmana
Lucas Beyer (bl16)
2 years
At CVPR I've been asked a few times already what I'd recommend to researchers in small labs without big compute. Here's a thread with my answers:
Tweet media one
5
84
460
@giffmana
Lucas Beyer (bl16)
6 months
Is there a template factory some where that I don’t know of? And can I get this shit off my timeline?
Tweet media one
Tweet media two
58
14
447
@giffmana
Lucas Beyer (bl16)
7 months
🧶PaLI-3 achieves SOTA across many vision-language (and video!) tasks while being 10x smaller than its predecessor PaLI-X. At only 5B parameters, it's also smaller (and stronger) than the concurrent Fuyu-8B model, though sadly we cannot release the model (props to @AdeptAILabs )
Tweet media one
9
58
446
@giffmana
Lucas Beyer (bl16)
20 days
It's Saturday. My 3yo is napping right now. Once he wakes up, I'll go fire up some H100's and help him code some of the easy ideas I have in the back of my mind. We might do it just in time for NeurIPS. Gotta start early and completely abuse my privilege, or so I heard🚀🚀🚀
14
10
438
@giffmana
Lucas Beyer (bl16)
3 years
This paper is all about large-scale pre-training of DL models. It completely lacks any mention of our work on this exact topic over the last >2 years. One of these is literally called "big transfer". Do I need to write a Schmidthuber-like blogpost about all our group's work now?
@ethanCaballero
Ethan Caballero is busy
3 years
Stanford's ~entire AI Department has just released a 200 page 100 author Neural Scaling Laws Manifesto. They're pivoting to positioning themselves as #1 at academic ML Scaling (e.g. GPT-4) research. "On the Opportunities and Risks of Foundation Models"
Tweet media one
17
394
2K
16
49
433
@giffmana
Lucas Beyer (bl16)
26 days
I recently overheard something like "Aren't Transformers standard in Vision now?" I wasn't sure. So this weekend, whenever the kid was asleep, I scraped, parsed, analyzed ALL CVF proceedings of the last decade to find out! Methodology, code, request for feedback in thread.
Tweet media one
16
38
435
@giffmana
Lucas Beyer (bl16)
2 months
GPT-5 live-testing its Q* DotA mode (internally known as QpenAI-5), just one of many capabilities that I've heard emerge at ultramassive scale. Still some work left to do though.
@SirActionSlacks
SirActionSlacks
2 months
so just played a game where we encountered an actual AI learning program as a teammate dude last picks invoker, walks top and has midas que'd up. He nails sunstrikes, but plays super wierd. We decide to chat a AI shutdown code and he stops moving ??
32
51
654
19
39
431
@giffmana
Lucas Beyer (bl16)
3 months
I really had to travel to HQ for a week to convince everyone they should just add ai to whatever they're doing. It's been an uphill battle, but they are slowly starting to!
Tweet media one
26
10
425
@giffmana
Lucas Beyer (bl16)
7 months
ICLR submissions are online: Looks like there's: - ~700 with diffusion in it, - less than 100 with nerf, - ~900 LLM - ~100 chatgpt (8 bard, 16 claude) - vs ~170 llama (yay) - ~200 clip (but not "clipping") - ~200 NLP - ~750 vision(!?)
Tweet media one
17
59
419
@giffmana
Lucas Beyer (bl16)
1 year
Did you know we can use scaling laws not only to predict optimal nparams, but actually optimal model „shape“? (depth, width, MLP size) Now you know! With this, we get a 400m param plain ViT to 90.3 on ImageNet, matches ViT-g on many benchmarks. Read more below or in our paper:
Tweet media one
@ibomohsin
Ibrahim Alabdulmohsin | إبراهيم العبدالمحسن
1 year
Excited to share our work on optimizing vision transformers. We advance scaling laws to infer compute-optimal model shapes, achieving better results with smaller models, eg. 90.3% in ImageNet with 400M params. This surpasses the much larger ViT-g! abs:
Tweet media one
1
39
234
9
64
401
@giffmana
Lucas Beyer (bl16)
7 months
Pleased to announce we are releasing checkpoints for our SigLIP models! These are very strong image-text ViTs. We release them along with a colab to play around with. Most are english, but we also release a good i18n one. Sorry, no magnet link mic drop. More in thread🧶
14
67
398
@giffmana
Lucas Beyer (bl16)
2 years
1/N After 3.8 wonderful years at Google Brain in Zürich, I have decided it is time for me to embark on a new adventure. I'm thankful for all the amazing colleagues I've met so far, and hope to stay in touch and maybe even collaborate in the future. My ambitious next venture:
11
10
394
@giffmana
Lucas Beyer (bl16)
2 months
Every single paper where Nicholas Carlini is a lead is a banger. A responsible, grown-up, banger.
Tweet media one
@arankomatsuzaki
Aran Komatsuzaki
2 months
Google presents: Stealing Part of a Production Language Model - Extracts the projection matrix of OpenAI’s ada and babbage LMs for <$20 - Confirms that their hidden dim is 1024 and 2048, respectively - Also recovers the exact hidden dim size of gpt-3.5-turbo…
Tweet media one
16
151
971
12
32
397
@giffmana
Lucas Beyer (bl16)
2 months
YES. Thanks Andrej. To this date still, way Way WAY too many people doing DL are way Way WAY too careless. I think each small DL team needs at least two people who are obsessed with detail. But the team shouldn't be composed of solely such people either, or it'll go nowhere.
@karpathy
Andrej Karpathy
2 months
Beautiful work / attention to detail trying to get Gemma to finetune correctly. There are so many foot guns here to be super careful with. All of these issues don't throw any errors, they silently make your network worse. A great example of what I wrote about in my "A Recipe for…
88
317
3K
6
27
380
@giffmana
Lucas Beyer (bl16)
2 years
Video generation is the one thing I've actually been a long-term pessimist on. But with this, and the recent paper that generated 1h consistent (but low q) videos w/ diffusion, I may have to change my mind. Maybe if a big lab jumps on it, we'll get seriously impressed next year
@_akhaliq
AK
2 years
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers github:
49
362
1K
9
40
375
@giffmana
Lucas Beyer (bl16)
1 year
My colleagues managed to *actually* learn a generic optimizer. What was impressive to me is that with absolutely zero tuning, on the tasks I tried it, it matched our heavily tuned existing setup!
@jaschasd
Jascha Sohl-Dickstein
1 year
If there is one thing the deep learning revolution has taught us, it's that neural nets will outperform hand-designed heuristics, given enough compute and data. But we still use hand-designed heuristics to train our models. Let's replace our optimizers with trained neural nets!
Tweet media one
25
135
905
3
36
371
@giffmana
Lucas Beyer (bl16)
3 years
I have to admit, this is a pretty cool move, jealous I didn't think of doing that first! (and the DINO paper itself is cool too, of course)
Tweet media one
7
21
368
@giffmana
Lucas Beyer (bl16)
2 months
People are jumping on this as something special, meanwhile I'm just sitting here thinking «someone slid a few examples like that into the probably very large SFT/IT/FLAN/RLHF/... dataset and thought "this will be neat" as simple as that» Am I over simplifying? 🫣
@alexalbert__
Alex Albert
2 months
Fun story from our internal testing on Claude 3 Opus. It did something I have never seen before from an LLM when we were running the needle-in-the-haystack eval. For background, this tests a model’s recall ability by inserting a target sentence (the "needle") into a corpus of…
Tweet media one
589
2K
12K
35
21
369
@giffmana
Lucas Beyer (bl16)
10 months
@_aidan_clark_ Your OpenAI onboarding material is leaking ;-)
5
1
360
@giffmana
Lucas Beyer (bl16)
1 year
Beyond classification in vision, it always feels weird to optimize for a loss which doesn't _really_ match how we'll use the model later on*, but happens to be differentiable. In our latest work, we tackle this discrepancy🧶 *unless the model is 100% perfect, which it never is.
@__kolesnikov__
Alexander Kolesnikov
1 year
Vision meets RL! We reveal that policy gradient can be used for tuning vision models to optimize complex metrics, such as mAP, PQ or “color diversity”, observing large performance boosts on tasks like object detection, panoptic segmentation, etc.
Tweet media one
4
131
642
12
48
360
@giffmana
Lucas Beyer (bl16)
2 months
Seeing many simple Q's re Grok, let me answer w/o inside knowledge: 1. 😐benchmarks: a) raw model b) trained for interaction, not benchmarks. 2. Why tanh(30) attn? Avoid exploding logits. 3. gelu approx? Default in jax, most efficient. 4. 340b useless? not made for u. cont/
13
16
357
@giffmana
Lucas Beyer (bl16)
2 years
The ConvNeXt paper is rightfully getting some attention: it's good work and has beautiful plots. But, Fig1 needs a little correction IMO. They compare heavily aug/reg swin+convnext to plain ViT. We fixed this in which is what should always be compared to.
Tweet media one
@_akhaliq
AK
2 years
A ConvNet for the 2020s abs: github: Constructed entirely from standard ConvNet modules, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation
Tweet media one
11
210
977
4
44
349
@giffmana
Lucas Beyer (bl16)
2 years
Looks like they released some code and pre-trained models today: even with a pretty sweet colab.
@giffmana
Lucas Beyer (bl16)
2 years
1/N The return of patch-based self-supervision! It never worked well and you had to bend over backwards with ResNets (I tried). Now with ViT, very simple patch-based self-supervised pre-training rocks! First BeIT, now Masked AutoEncoders i1k=87.8% 🧶
Tweet media one
16
230
1K
4
58
345
@giffmana
Lucas Beyer (bl16)
6 months
Bullish on Google Meet
13
13
343
@giffmana
Lucas Beyer (bl16)
1 year
It happened! Today I saw a video tutorial on TikTok explaining how to make a trading bot with the help of chatGPT, which created s strategy that provides an insane 42000% profit. AI is amazing. I took tomorrow off to implement this, and then I'll see you on my yacht, suckers!
20
9
335
@giffmana
Lucas Beyer (bl16)
2 years
And I bet even this will not be enough to kill the "dumb scaling won't make the model understand cow on beach! We need XYZ" argument forever.
Tweet media one
28
18
328
@giffmana
Lucas Beyer (bl16)
2 months
I’m not a SF kinda guy, but _finally_ it looks like we did something for Gemini that’s actually cool, as opposed to some SVP copypasta blogpost.
@BenHolfeld
Ben Holfeld
2 months
Google Gemini 1.5 Hackathon with the Google Founder at AGI House
49
146
1K
5
15
336
@giffmana
Lucas Beyer (bl16)
3 years
If you haven't read our latest ImageNet SOTA work "Vision Transformers (ViT)" yet, shame on you. But! There's hope! Here's the corresponding blogpost which is a nice tl;dr:
Tweet media one
7
69
331
@giffmana
Lucas Beyer (bl16)
2 years
Disagree. As soon as you throw sparsity in (and depthwise/tiny-group conv is a form of sparsity) FLOPs detach from reality. That's why sparse nets are hard (), and EffNetV2 actually UNDOES a lot of depthwise. EffNetV1 == MobileNetV3 == designed for CPU.
@cHHillee
Horace He
2 years
Some interesting discussion on r/machinelearning about EfficientNet and CNN efficiency. TBH, I think FLOPS as a measurement of models sometimes gets a bad rap. It has its downsides, but it's one of the harder metrics to "game".
Tweet media one
3
9
65
27
26
306
@giffmana
Lucas Beyer (bl16)
2 years
Giving a lecture introducing the Transformer architecture in all gory details at @M2lSchool tomorrow. Also got permission to publish slides and will share recording if/when I get one. It's a pretty cool set of slides, largely thanks to @_basilM for inspiration!
11
14
324
@giffmana
Lucas Beyer (bl16)
5 months
Wow! FAIR was home to the best computer vision researchers. But over the last couple years, one by one, they left. It's now a shell of its former self. This is not a hate post: I like and admire them. But wonder what went wrong. I'd love to buy a book that tells their story.
27
6
321
@giffmana
Lucas Beyer (bl16)
2 months
whichever one of you guys wrote this, thanks, you made my week😊
Tweet media one
3
13
323
@giffmana
Lucas Beyer (bl16)
26 days
Soooo language folks have rediscovered generating and ensembling multiple predictions at test-time helps. In vision, it’s multi-crop eval ; we know it works, but collectively decided to stop reporting it. but… somehow… now we call it agents? Did I get this right?
29
14
317
@giffmana
Lucas Beyer (bl16)
2 years
Now anyone can download and play with a TRILLION-parameter language model! I'm obviously biased, but happy that Google Brain is showing some good leadership in the right direction for science here, by allowing them to release the model, no-nonesense. PS: I haven't tried it yet!
@LiamFedus
William Fedus
2 years
Today we're releasing all Switch Transformer models in T5X/JAX, including the 1.6T param Switch-C and the 395B param Switch-XXL models. Pleased to have these open-sourced! All thanks to the efforts of James Lee-Thorp, @ada_rob , and @hwchung27
19
208
1K
10
24
312
@giffmana
Lucas Beyer (bl16)
1 month
It literally does not matter the domain: sports, art, engineering, music, carpentry, ... I just love watching (or reading about) people at the absolute top of their game. Even without knowing much about the domain, it's often easy to tell who's in a league of their own.
@mippl3
myq
1 month
This is explains how the xz backdoor was found
47
2K
14K
9
22
313
@giffmana
Lucas Beyer (bl16)
1 year
This comment I screenshotted below is a really on point description of the current vibe I feel. I share the commenter’s fear. Though it’s not all lost yet! For instance, I’m involved in 4 neurips submissions and had 2 iccv ones. Really hoping our openness keeps going like this💪
Tweet media one
Tweet media two
@togelius
Julian Togelius
1 year
I think this post channels what a lot of people in the AI community feels right now. As if the stark hypocrisy wasn't enough, there is now also the blatant gatekeeping attempts. I've started thinking worse of people who choose to still work for them.
25
87
574
12
48
305
@giffmana
Lucas Beyer (bl16)
11 months
Most recent large transformer decoders use this trick of having multiple heads for the queries, but only one for the keys/values. I always thought it’s a small not well documented trick of the trade. But no, there’s a nice paper about „multi query attention“, of course by Noam.
@arankomatsuzaki
Aran Komatsuzaki
11 months
When all you need is one write-head, not a co-author or many citations
Tweet media one
0
7
80
8
43
308
@giffmana
Lucas Beyer (bl16)
3 years
Our Mixer has been cooking for a while! We present a novel architecture composed of only MLPs. Look 'ma, no conv, no attention at all. And it works as good as the best ResNets and ViTs. Major props to Ilya Tolstikhin and @neilhoulsby who led this investigation.
@_akhaliq
AK
3 years
MLP-Mixer: An all-MLP Architecture for Vision pdf: abs: MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs)
Tweet media one
9
110
537
18
63
303
@giffmana
Lucas Beyer (bl16)
2 years
#Parti : A photo of an astronaut riding a horse in the forest. There is a river in front of them with water lilies. I ike this visually very clear effect of the benefits of scale:
Tweet media one
@_arohan_
rohan anil
2 years
A new image generation model just dropped. Great work by the team! + Auto-regressive, encoder->decoder Transformer + Classifier-free sampling. + ViT-VQGAN Really amazing results: Image from the website.
Tweet media one
13
105
482
8
36
298
@giffmana
Lucas Beyer (bl16)
1 year
Meanwhile, in central/west EU (east idk) as a PhD student you get: - a standard wage you can live off just fine, no roommates needed. - not just your own desk, but often a 2-person office (!) - in return, you have to teach ~half the time & grade exams I recommend doing PhD here.
26
16
294
@giffmana
Lucas Beyer (bl16)
10 months
LongNet/1B seqlen. Saving you the click: - is using hierarchical dilated attention similar to (but not same as) BigBird. - no experiment longer than 32k - scaling curve at least seem not pessimistic So I’ll wait for v2 which actually scales this.
Tweet media one
@arankomatsuzaki
Aran Komatsuzaki
10 months
LongNet: Scaling Transformers to 1,000,000,000 Tokens Presents LONGNET, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences abs: repo:
Tweet media one
31
286
1K
7
35
290
@giffmana
Lucas Beyer (bl16)
2 years
Allow me a brief moment of not-so-humble brag time? I had 6 CVPR submissions*, for which: 6/6 I wrote code/ran experiments. 4/6 I'm co-first-author. 6/6 avg(reviews) > borderline. 4/6 accepted. Pretty happy! *fine-print: but 3 of them are re-submissions, I'm no super-human :)
15
3
293
@giffmana
Lucas Beyer (bl16)
2 months
If this pans out to work robustly across models and tasks, i think this could be one of the rare huge breakthroughs that in a few years we'll wonder «what took us this long?»
@AnimaAnandkumar
Prof. Anima Anandkumar
2 months
For the first time, we show that the Llama 7B LLM can be trained on a single consumer-grade GPU (RTX 4090) with only 24GB memory. This represents more than 82.5% reduction in memory for storing optimizer states during training. Training LLMs from scratch currently requires huge…
48
391
2K
8
16
290
@giffmana
Lucas Beyer (bl16)
8 months
This shit has to be taught in schools when my kid gets to puberty. This is a whole new genre like cubism, baroque, etc. (I'm uneducated about art, but love this one - does it already have a name?)
@AndrewCurran_
Andrew Curran
8 months
This one is so much better than the spiral for me. Probably some weird brain thing.
Tweet media one
71
362
5K
17
4
290
@giffmana
Lucas Beyer (bl16)
2 years
Testing how our Unified Vision Model (UViM) works on the notoriously difficult, AGI-hard, "cow on beach" task. Of course, no such picture exists: it's completely OOD form the real world. So I had to #imagen some. Then ask UViM to panoptic segment them. Please read full🧶
Tweet media one
@__kolesnikov__
Alexander Kolesnikov
2 years
I've always been frustrated that, beyond image classification, computer vision is full of complex and task-specific components. Thus, very excited to share our new work, where we propose a unified modeling approach for vision: . More in the thread🧵.
Tweet media one
6
127
602
7
32
290
@giffmana
Lucas Beyer (bl16)
11 months
Paper writing protip: Most papers are not read end-to-end. Ain't nobody got time. Write with that in mind. Make sections, figs, tables and their captions as self-contained and "guessable" as reasonably possible. Example: call your models Foo-M and Foo-S instead of Foo and Foo*
8
18
285
@giffmana
Lucas Beyer (bl16)
1 year
9/9 final thoughts. - I really like the "trend reversal" of seeing how much can be done with limited compute. - I am a big fan of the gray text passages for things that were tried but didn't work. - The lr sched part is fishy, but not super important. - Impressive bibliography!
4
8
279
@giffmana
Lucas Beyer (bl16)
2 years
How it started How it's going
Tweet media one
Tweet media two
6
17
272
@giffmana
Lucas Beyer (bl16)
2 months
You know what's my favourite part with our Gemma release? That we do not misuse the term "open source" like other labs have. It was explicit in the comms briefing that we should call them "open models" and not "open source models". Much respect to the team.
Tweet media one
10
19
277
@giffmana
Lucas Beyer (bl16)
2 months
What are some computer-vision tasks that are actually useful IRL and cannot be done by any of the current gen LLM chatbot with image input? Not looking for academic made-up benchmarks or brain-teaser tasks, only things that actually help you do stuff IRL.
108
23
275
@giffmana
Lucas Beyer (bl16)
5 months
Yann is trying to erase history!! Before luatorch, there was in fact Torch3 (C++) and it had the most legendary author pictures of a software library to date. I’m not making this up:
Tweet media one
@ylecun
Yann LeCun
5 months
@JosephJacks_ @ai_for_success @teoliphant @gvanrossum Actually, the history is PyTorch (FB) <- LuaTorch (NEC+NYU) <- Torch5 (IDIAP) <- Lush (AT&T) <- SN3 (Léon Bottou+Patrice Simard+me) <- SN (Léon & me)
3
1
47
9
16
268
@giffmana
Lucas Beyer (bl16)
10 months
Can I just say that the x-axis on this plot is an actual masterpiece? I would love to know who @AnthropicAI did that.
@OfirPress
Ofir Press 🖋
10 months
If Claude 2 turns out to be as strong as GPT-4, thereby breaking the OpenAI monopoly on strong LMing, the number of companies building products on top of LMs will increase substantially.
Tweet media one
15
18
199
11
10
265
@giffmana
Lucas Beyer (bl16)
8 months
Perfect fit regarding the recent "Oh my, AI PhDs should join/create startups!!"
@tautologer
tautologer
8 months
Linus Torvalds on his net worth. pretty based tbh
Tweet media one
25
520
6K
6
8
264
@giffmana
Lucas Beyer (bl16)
1 year
There's (almost) nothing better on this earth than polishing a fancy matplotlib figure while listening to nice music and having a good (Belgian) beer or cappuccino. Can't share the current one yet, so here are some past ones that I like, just because. (arxiv links in alt-text.)
Tweet media one
Tweet media two
Tweet media three
Tweet media four
16
8
260
@giffmana
Lucas Beyer (bl16)
1 year
This is *exactly* what I had in mind when disliking the term "emergent" recently. It seems due to the metrics (like binary correct/incorrect), in reality the model does smoothly approach the right answer. But I was too lazy to verify this intuition myself, glad this paper did!
Tweet media one
Tweet media two
@_akhaliq
AK
1 year
Are Emergent Abilities of Large Language Models a Mirage? present explanation in a simple mathematical model, then test it in three complementary ways: (1) make, test and confirm three predictions on the effect of metric choice using the InstructGPT/GPT-3 family on tasks with…
Tweet media one
19
142
654
8
32
255
@giffmana
Lucas Beyer (bl16)
3 months
OpenAI recruiters 👀 on this thread lol Well played, Andrej!
@karpathy
Andrej Karpathy
3 months
@eladgil @patrickc In AI at least, the real 30 under 30 imo you have never heard of. They are 5 layers down the org chart from the CEO. They are usually not on Twitter, they have an unmaintained LinkedIn, they don’t go on podcasts, and they maybe published at one point but don’t do so anymore. They…
141
448
5K
6
3
255
@giffmana
Lucas Beyer (bl16)
3 years
Our NeurIPS'21 workshop on "ImageNet: past, present, and future" has been accepted! I'm excited about our speaker line-up. I'm even more excited to see what papers researchers will submit to the workshop! Please spread the word, and consider submitting.
Tweet media one
2
44
255
@giffmana
Lucas Beyer (bl16)
6 months
This is actually the most sensible take I’ve read so far: Sam may have tried starting/running too many other startups on the side, that could become huge on the back of OpenAI, and may not have openly disclosed all of them? They all make a lot of sense too!
@cto_junior
TDM (e/λ)
6 months
Hmm, so board hates Sam cause he wants to secure the company's future by reducing dependence on hardware providers
Tweet media one
36
32
615
16
19
251
@giffmana
Lucas Beyer (bl16)
6 months
1/4 Did you know bfloat16 stands for Brain Float16 and was invented by Google Brain for stable and fast NN training? I feel like the rest of the world thinks half-precision training has to be painful, because nvidia didn't implement bf16 forever and f16 sucks (loss scaling??).
Yeah I knew
592
Huh, today I learned
868
Who cares, shut up Lucas
267
14
28
250
@giffmana
Lucas Beyer (bl16)
1 year
I have never felt this seen
@daisyldixon
Daisy Dixon
2 years
Tweet media one
63
1K
15K
5
15
248
@giffmana
Lucas Beyer (bl16)
2 years
I agree. Personally I still like the term "pre-trained models". It's short, clear and to the point. The "large" part I feel is a current necessity, but not a key property. I think currently it's used to imply "rly good", but in the future we might get equally good small models.
@mmitchell_ai
MMitchell
2 years
Reminder to everyone starting to publish in ML: "Foundation models" is *not* a recognized ML term; was coined by Stanford alongside announcing their center named for it; continues to be pushed by Sford as *the* term for what we've all generally (reasonably) called "base models".
35
66
467
18
21
249
@giffmana
Lucas Beyer (bl16)
2 years
Most people have absolutely no sense for the insane diversity of things covered in O(billion) web images. I'm not sure it is meaningful to talk about ood, distribution shift, generalisation, etc. anymore at that scale. It will take the collective us some time to digest this.
@hardmaru
hardmaru
2 years
Whenever I think #Dalle is being creative, I also think of all the weird pics posted on the Internet that could have been in the training set.
6
9
179
18
19
241
@giffmana
Lucas Beyer (bl16)
7 months
> next token prediction cannot lead to learning a world model. It's dumb. It's "just" stats. Next token prediction:
@wesg52
Wes Gurnee
7 months
Do language models have an internal world model? A sense of time? At multiple spatiotemporal scales? In a new paper with @tegmark we provide evidence that they do by finding a literal map of the world inside the activations of Llama-2!
183
1K
6K
11
15
238
@giffmana
Lucas Beyer (bl16)
1 year
So OpenAI is being sued now. Stability and Midjourney are already being sued. Things are getting "interesting". Is there a website or someone to follow that writes summaries and updates covering these "ai lawsuits"? I want to closely follow, but also am lazy.
@short_straw
The Short Straw
1 year
I promised a thread this weekend about OpenAI and the lawsuit I filed against them, and an explanation of what I hope to achieve here. Sorry for the length, but there's a lot going on here. To begin with, we need to understand what “OpenAI” really is: a poorly constructed scheme…
Tweet media one
90
848
2K
27
15
235
@giffmana
Lucas Beyer (bl16)
7 months
That’s why we need to get rid of tokenizers and try to use raw inoutput, like in vision! ByT5 () and MEGABYTE () make nice first steps, we need more of that.
@alexjc
Alex J. Champandard 🌱
7 months
🔥Hot Take?🧨🧑‍🚒 LLM alignment starts with biases in token embeddings. If you can't get that part right, reinforcement learning and/or a few thousand example chats isn't going to help!
3
6
48
10
21
235
@giffmana
Lucas Beyer (bl16)
2 years
We don't cite our tools enough. I want to "boilerplate cite" all important tools in future papers, they deserve the credit. My candidates: - numpy - matplotlib - jax - TPUs (XLA?) - Jupyter (colab) What are yours? Which do I miss? PS: I used to do this a bit, but lost habit:
Tweet media one
Tweet media two
Tweet media three
22
11
232
@giffmana
Lucas Beyer (bl16)
3 months
@sama A big Transformer-style robot taking an image, cutting it into a grid of 16x16 small patches, eating those patches up. Once done, a comicbook-style text bubble is shown, indicating the robot saying the words "mmm, this image was definitely worth 16x16 patches."
10
4
232
@giffmana
Lucas Beyer (bl16)
1 year
See, LLMs don’t magically get skills out of thin air, as some papers suggest. This is a very nice paper taking a deep dive into one of them (translation skill) and it clearly comes from it being in the data. I think that’s great and a good motivator for training on everything!
@ebriakou
Eleftheria Briakou
1 year
🔎1.4% of PALM’s training instances are detected as bilingual, while 0.34% contain at least one translated sentence pair. We were able to mine such pairs across all languages studied; therefore, none of these languages is truly zero-shot in the context of translation.
Tweet media one
Tweet media two
2
16
155
6
28
230
@giffmana
Lucas Beyer (bl16)
1 month
Let me translate this list: "has a PhD (or equivalent) and has done at least one nontrivial half-year project fully end to end"
@lorgiusti
Lorenzo Giusti
1 month
A friend sent this asking if they’re looking for a researcher or an entire AI lab. Does someone has all this stuff with a good work/life balance?
Tweet media one
35
16
209
8
19
231
@giffmana
Lucas Beyer (bl16)
1 year
PS: This thread took me almost as long as a paper review. Looks like I procrastinate my CVPR reviews by making twitter paper reviews instead ¯\_(ツ)_/¯
21
4
225
@giffmana
Lucas Beyer (bl16)
8 months
In the same spirit, I keep preaching: In today’s age, please stop taking test set as IID split from the training data. Create large noisy training data (or even none!), but *small, very high quality* test data. We currently suffer from benchmarking on low-quality test sets.
@RajanVivek52643
Rajan Vivek
8 months
Can you reliably evaluate your model with just a handful of test examples? Yes, you often can! Anchor Points are tiny -- but surprisingly representative -- subsets of benchmarks. They can predict which other points the model will fail on… without evaluating on those points! 🧵
Tweet media one
5
37
261
5
27
225
@giffmana
Lucas Beyer (bl16)
3 months
so much pressure omg
@vikhyatk
vik
3 months
getting a lot of DMs asking how to get into computer vision. i am no expert, i can only share what i did: 1. follow @giffmana 2. read all of his papers 3. watch recordings of all of his talks on youtube 4. study every tweet he posts for extra alpha
8
9
279
11
2
227
@giffmana
Lucas Beyer (bl16)
11 months
Matting = creating an alpha mask to cutout a foreground object. Think of background effects in video-conf. ViTMatte shows how to adapt plain, generally pre-trained ViTs to perform SOTA Matting. I'll walk you through the paper and give context on ViT for detailed outputs:
Tweet media one
3
31
224
@giffmana
Lucas Beyer (bl16)
3 months
Anyone else permanently annoyed by the mismatched length of True/False keywords? Even the strings yes/no have mismatched lengths, who the fuck invented English? I present to you my newest solution to this eternal thorn in the eye:
Tweet media one
60
4
222