I have mad respect for Karpathy. But RL agents will not find exploits in physics that give us infinite energy, or anything like that.
As somebody who knows a thing or two about both AI and physics, I am quite certain of this. The so-called standard model of particle physics…
Banning GPU sales to China will slow them down in AI for a few years, but will push them to develop their own GPUs faster. Ironically this political move threatens NVIDIA's global dominance by encouraging a well-funded competitor to invest urgently.
Waaat? You can find eigenvectors for a hermitian matrix from the eigenvalues of it and its submatrices? New linear algebra fact discovered by physicists researching neutrinos
I didn't quite believe it so I tried it myself: It works.
Deep Learning replaced linear models because it automated feature engineering by trying thousands of possible nonlinearities and learning which ones worked. Similarly, transformers are replacing task-specific NN architectures by learning how to combine the input signals. 1/
This is pretty cool - AWS announces a public quantum computing (QC) platform named after my grandfather's notation for wave functions. Great that even in its infancy, QC is already getting democratized, not centralized in the hands of a few powerful research groups.
Stochastic Weight Averaging (SWA) is totally magic! It takes me days to train my validation score down from 2.40 to 2.30, but just taking the mean of a bunch of checkpoints I have lying around and I'm down to 2.07. Thanks
@andrewgwils
and team for figuring this stuff out.
Nice video interview of my grandpa. Content rambles between symmetry, gravity, time -- the deep physics questions of the late 20th century. Not scientifically important, but I think the best video recording of him I've seen.
My thoughts on Google’s
#QuantumSupremacy
claim, which I haven't seen elsewhere. Even though my last name is Dirac, and my grandpa discovered lots of quantum theory, I’m no expert in QC. But I did get A’s in my quantum classes, and understand numerical computing very well. 1/8
Most ML hyper-parameters are naturally log-scaled. If you're doing an automated configuration sweep, don't let your tools linearly search over things like learning rate, weight decay, or embedding dimension.
Trying to speed up your python programs? Tired of writing "print(time.time() - start_time)"? Check out "timebudget", a new tool I built for very simple profiling in python. Inspired by tqdm's simplicity. Literally just a few lines of code.
This is currently my favorite quick introduction to the standard model. Does a great job of separating the things in physics that effect our reality from all the details we understand but have no bearing on anybody’s life except physicists.
When building Amazon Machine Learning in 2013, I was the only deep learner on the team. I remember in an early planning meeting sharing that I wanted to be able to check my training jobs from my phone. Never happened. Now that I'm using
@weights_biases
I finally can.
When will we see NN chips built specifically for Transformers? (I'm often asked.) With NVIDIA, I'd say we're already there. Look at benchmarks for the new H100 chip - the task it is best at is a Transformer. Seems they're already optimizing their silicon for Transformers. 1/
People still love this talk I gave on how Transformers work and the history of NLP that led to them. Years later people keep finding and watching it. Makes me think I should record more.
LSTM is dead. Long Live Transformers
This is one of the best talks that explain well the downsides of Recurrent Networks and dive deep into Transformer architecture.
Transformer models have dramatically changed NLP in recent years, outperforming previous techniques like LSTM in almost every way. An exception has been that they don’t scale well to large documents because they cost O(N^2) in document length. This paper offers an O(N) solution.
Thrilled to share our new work! "Linformer: Self-attention with Linear Complexity".
We show that self-attention is low rank, and introduce a linear-time transformer that performs on par with traditional transformers.
Check our here:
There are so many good reasons to use on-prem GPU workstations for AI instead of putting everything in the cloud. Cloud is great, of course. But local workstation hardware can not only be cheaper, but enable higher productivity than shared cloud resources.
Cool new AutoML from AWS: SageMaker Autopilot. Unlike other AutoML black boxes which just give you a model, this also gives you working code that you can learn from, adapt, and customize. Great to set baselines and get started. (Disclosure: this was my baby before I left Amazon.)
Previously anybody could buy GPUs, so it was a commercial matter for Chinese companies to compete on, who would need to justify investment for possible return. Now it's the CCP's problem, and they have a massive firehose of money suddenly motivated to fix the problem.
It's typically an order of magnitude more work to write code that's generic for lots of purposes than to write one-off code that just does what you're trying to do right now.
I was recently asked if I'd prefer to use
@PyTorch
or
@TensorFlow
for a small task, with indication they'd prefer TensorFlow. I invoked this tweet to support my choice.
Self-supervised learning (SSL) can seem confusingly magical - without labels how could it learn a semantically useful representation? It learns based on the input data's distribution, and the augmentations that change the input but that you insist not change the representation.
In this sense, every spectrometer in every chemistry lab could be considered a quantum computer that has achieved
#QuantumSupremacy
for decades. But only if you include very specific computational problems that are intrinsically quantum in nature. 8/8
I built this “electronic funhouse mirror” for Halloween to turn people’s faces into animals and monsters using modern deep learning methods in real time. Thanks to great tools from
@nvidia
and
@PyTorch
I could do this in a weekend on my laptop!
I've said it before and I'll say it again: SWA is like magic. I think for many deep learning practitioners this will be the fastest easiest way to improve your model quality with almost trivial code changes.
Stochastic Weight Averaging (SWA) is a simple procedure that improves generalization in deep learning over Stochastic Gradient Descent (SGD). PyTorch 1.6 now includes SWA natively. Learn more from
@Pavel_Izmailov
,
@andrewgwils
and Vincent:
So it does seem fairly inevitable to me that Transformers will effectively take over deep learning, ML, and AI in coming years. Even if they're not optimal for every task, their generality will lead to a useful standardization of tools, algorithms, and even hardware. 6/
"AI" in 2020 := Any algorithm that is so complex it defies explanation or interpretation, and is capable of both impressive results and embarrassing failures.
Instead of manually engineering the inductive bias by carefully picking which neurons to connect, and which weights to re-use, Transformers connect all the neurons in every combination, and learn which connections to actually use through attention mechanisms. 2/
Playing with
@PyTorch
JIT compiler. Very cool stuff that lets you write your code within standard python, and later compile it to TorchScript for production inference or embedded use. I wrote a simple function decorator to make it easier to try:
My company Groundlight is coming out of stealth today, offering Computer Vision powered by Natural Language for industrial and commercial applications, integrated from edge to cloud to real-time human monitoring.
In this way Transformers perform something a lot like NAS (Neural Architecture Search) but within a single simple SGD process instead of complex inner and outer optimization loops needing RL techniques or surrogate models. 5/
My kids' room has glow-in-the-dark stars on the ceiling, accurately showing Orion and Taurus. About a year ago during some horseplay, Betelgeuse got knocked down. Now it seems this accident might just turn out to be prophetic.
@hardmaru
Open source licenses. Newer versions of bash use GPLv3 which is a pretty business hostile license. Apple was stuck on a very old version of bash that had GPLv2. Licenses matter.
I wish GitHub issues were more like StackOverflow questions. I see at the top it's closed, but I have to sift through pages and pages of comments to figure out why it was closed. Was it ever fixed? Did the team say they'd never fix it? Was it merged? Auto-closed?
Postgres is adding the ability to query for similar vectors with kNN. All those years at AWS lobbying for a kNN service, seeing so groups with huge databases of embeddings, struggling to deploy them in production.
However everybody should be fully aware and honest about the fact that QC has exactly zero practical applications today. Top QC researchers are still trying to find anything that today's QCs can do that's actually useful. You wouldn't know this by reading
For me, Google’s
#QuantumSupremacy
claim is similarly unimpressive. They carefully designed a problem that could be solved much faster on QC than on a classical computer. But it’s not an interesting problem that anybody would want to solve in any other setting. 5/8
Many hyperparameter optimization (HPO) runs yield negligible gains. Too often these "gains" are just a lucky sample from the training noise. BUT some big steps can also be seen as HPO. VGG was AlexNet with HPO. EfficientNet largely too. Scaling up often requires careful HPO.
Transformers can learn to apply something a lot like a convolution when needed, or like a recurrent connection if only the previous input is useful to process the next. But critically they can learn much more complex relationships between inputs. 3/
As with linear models -> NNs, with transformers we again have a step up in computational complexity in exchange for less problem-specific analysis, which is almost certainly sub-optimal from a model quality perspective. 4/
Overheard, my 5yo talking to herself: "Live from NPR news, I'm Jack Speer. President Trump said Blah blah-blah blah BLAH." My little self-supervised learner seems to be overfitting.
Reading RL papers with amazing results from
@OpenAI
and
@DeepMindAI
I'm struck by a stylistic difference. One spends most of the paper proving how awesome their result is compared to other techniques. The other goes into great detail explaining how and why their techniques work.
I don't see a fundamental problem using massive compute to advance AI - research means pushing what's possible with today's tech to inform tomorrow. But I fully support the proposal by
@etzioni
and others to publish compute cost and efficiency in papers.
Impressive AutoML paper from my old team at Amazon: An elegant solution to the critical real-world problem in hyperparameter tuning of picking your search ranges. A simple data-driven approach works surprisingly well. Hope this gets into SageMaker soon!
I think it's awesomely hilarious when people assume the lab workspace behind me is a zoom background. Then I walk back into it and play with the robot.
Automatic Domain Randomization (ADR) is one of the key innovations enabling
@OpenAI
's recent impressive Rubik's cube result. Having absorbed the whole paper today (great use of a day!) I'll summarize the ADR algorithm here
#MLTLDR
1/7
I designed SageMaker Autopilot to be AutoML for all skill levels. New users can build a decent model, and look at the generated code for data prep and training to learn good practices. Hand it to expert coworkers for an easy baseline to build upon.
If I wasn't clear, I'm a huge fan of the
@huggingface
transformer package. NLP problems that took dozens of engineer/scientist years of effort just a few years back are now straightforward for motivated individuals. Total sea-change.
Today I ran been every Dick’s Drive-In in Seattle, eating at every one.
#BurgerRally2020
with
@rachelbeda
and others. 5 stops, 18 miles, three milkshakes, three fries, three deluxe and a special. (Minus beef.) Still kinda hungry.
It's very satisfying to finish a careful hyperparameter search and realize that the configuration you'd already hand selected after a bit of experimenting is, in fact, just about optimal.
My mind is kinda blown. Did you know you can just run python+numpy code directly in a browser? On my mac it's 15x faster than plain javascript (but still 40x slower than native CPU). I'm becoming convinced WASM is an important trend.
In 2014, a bot beat the Turing Test by pretending to be a non-native-English speaking teenager who understandably avoided lots of questions. They won, but who cares? We're still a long way from AI capable of convincing conversation. 4/8
@_brohrer_
It really doesn’t help to leave out sensitive features like race or gender. These are highly correlated with many useful behavioral features. And you can’t leave out everything they’re correlated with - there would be nothing left sometimes.
Now that we understand deep learning reasonably well, it means any process or system that can be differentiated through can be directly optimized with neural networks. Making a differentiable physics simulator is a difficult task, but would enable some amazing things.
Breakfast options for the kindergartner: “Would you like corn spheres, oat toroids, wheat manifolds, or corn matrices?”
“Corn matrices please.”
#geekparenting
By analogy, it's possible in theory, but computationally intractable, to discover through quantum simulation that water molecules “vibrate” at 2.45 GHz. But this result is very easy to obtain in a physical lab by measuring the interacting wave functions of actual water. 7/8
I have a feeling this technique could become standard for all computer vision in coming years. Big claim I know. But this seems to elegantly solve a fundamental problem with CNNs.
Neural Networks Are Not Robust Enough: they exploit the association between local patches (e.g., background) and the label. Our
#NeurIPS2019
paper w.
@zacharylipton
fights against this tendency.
Camera-ready:
It also introduced ImageNet-Sketch dataset.
Google's problem involves simulating interacting quantum wave functions. Computational chemists knows these calculations are notoriously hard. But ironically (or obviously) physical systems perform these “calculations” almost instantaneously in the real world. 6/8
Classic Google: "Cloud Print needs some engineers to do some boring work to keep it running.
Bueller?
Let's just cancel it. The world doesn't need our expertise for this anymore."
Contrast Amazon's customer obsession against Google's employee obsession.
One of my friends that I used to go to this thing in the desert with became fascinated with a little town in the desert we drove through, got to know some people there, wrote a book about them, then that book became a movie, and people voted it the best movie of the year. Wow.
Yesterday I ran my first marathon. Surprised at how fast I ran - under 4 hours including a wrong turn (extra 1/3 mile) and a couch break for some cake and to snuggle the Blerch. Thanks
@Oatmeal
for organizing a super fun event! Finished 10th out of 131.
I wonder if any countries other than China will be able to contain their covid-19 outbreaks. I wouldn't be surprised if it's only been possible there because of a population that truly believes in collective responsibility, coupled with a very decisive government.
I dropped my phone in the ocean today while SUP’ing. I’m about to go see a friend for dessert. Without a phone. Leaving the house without Internet. First time in … years. Slightly terrified. Send me strength. I can do it.
So what comes next after feature engineering and then NN architecture engineering? I’m guessing “sequence engineering” where we try to figure out the right way to phrase our tasks as questions that Transformers can answer. 7/
I realized how silly it was to have a space heater under my desk in the garage when I have a rack of GPUs in the other corner. Now the GPUs vent right at my desk. Mmmmm, toasty warm training jobs.
I gave a talk this week on methods for quantifying uncertainty in neural nets, and posted the notebooks I used to generate the example data. Fun to compare Deep Ensembles, MC Dropout and GP's on a toy regression problem.
I see this as very similar to when the Turing Test was officially beaten in 2014. While the accomplishment met the original criterion, it missed the spirit of the goal. The Turing Test was a (THE?) key goalpost in AI for decades, but the actual victory seems meaningless now. 3/8
@dhuang26
True! But it's a far cry from exploiting buffer overflows in reality - this is the natural 2024 extension of many decades of fusion research. Dream big. But ground in realism for getting things done. AI won't discover magic, it can help us build things that will seem magical.
"Four years ago! Shocking."
"This aged like fine wine."
"I need that belt."
^^^ three most recent youtube comments on my 2019 video explaining transformers.
My hard-working GPU laptop sounds like it grew an angry floppy drive that's constantly seeking. Poor thing wasn't built to optimize neural nets 24/7. And the warranty just expired. So I'm grabbing the precision screwdrivers and going in! Wish me luck...
I have a number in [-1,1] range I want to stretch toward 0, so I'll raise it to a power like 2 or 3. I'll have to correct the sign, right? Let's check...
>>> -0.5**2
-0.25
Huh - I guess python has some math magic here. Cool.
NOPE!
>>> (-0.5)**2
0.25
Gets me every time. 😂
Finished installing solar panels on our VW camper van (named Beethoven). People often ask if it can drive on solar power. It can't. But if it could, it would probably take about 2 hours of full sun to get enough charge to drive a single mile. 😂
HR tip: Don't schedule interviews for the day before a holiday. Four years ago I interviewed at OpenAI, on the Wed before Thanksgiving. I thought all the interviews went well except the last one of the day, where the interviewers seemed just annoyed from the start. Go figure.
One-off code is especially important in data science, where you often face questions like "does this technique help or even work for us?" The answer is often "No" and the less code you can write to reach that answer the better.
I am super impressed with University of Washington's COVID testing. I developed some symptoms, so yesterday morning I called them at 8am. Got a test appointment for 2 hrs later, and (neg) results posted by 10pm that night. 14hrs from calling for a full PCR test result. So good!
Chewing on this paper showing that a single biological neuron needs a time convolution net (TCN) with at least 5 layers to replicate its behavior. RNN seem like a much better choice than TCN - harder to train, but much more biologically plausible.
I love the sentiment here to try to avoid algorithmic bias. But the scientific advice is actually bad and misleading. Excluding sensitive features like gender from your inputs really doesn’t help. Better to identify them and force algorithmic fairness.
LoRA is the fine-tuning algorithm I always wished I had when I was building models in the early days. It's easy to ensure you don't catastrophically forget the original data, and gives you a simple knob to say how important your fine-tuned dataset is vs the original.