@Luke_Metz
Luke Metz
3 years
New paper: when to use gradients DL researchers often compute derivatives though just about everything (physics simulators, optimization procedures, renderers). Sometimes these gradients are useful, other times they are not. We explore why. 1/7
13
158
796

Replies

@Luke_Metz
Luke Metz
3 years
We show that when computing a gradient through an iterative system, we need to compute terms which consist of a product of the state transition Jacobian. This product is what causes issues. If the Jacobian's eigenvalues are > 1, gradients explode. < 1, gradients vanish 😱 2/7
Tweet media one
3
1
34
@Luke_Metz
Luke Metz
3 years
We demonstrate exploding gradients in physics simulation, molecular dynamics, and learned optimization. In the absence of noise, the loss surface can be high curvature, causing large gradients. While averaging smooths the loss, the grad variance still grows exponentially. 3/7
Tweet media one
Tweet media two
Tweet media three
1
0
31
@Luke_Metz
Luke Metz
3 years
Looking at the Jacobian of 2 different initializations, we see that the stable initializations have smaller eigenvalues, and thus more controlled gradients. 4/7
Tweet media one
1
0
18
@Luke_Metz
Luke Metz
3 years
So what to do about this? A few approaches. One is to use truncated backprop through time. While this somewhat works, it’s finicky and less performant than simply training *without* gradients. 5/7
Tweet media one
1
0
20
@Luke_Metz
Luke Metz
3 years
Getting rid of gradients leads to faster training?!? While this seems counterintuitive, consider trying to optimize a wiggly function convolved with a Gaussian. The more wiggly the function, the higher the grad variance. With blackbox/evolution, variance remains constant. 6/7
Tweet media one
2
3
34
@Luke_Metz
Luke Metz
3 years
Thanks Daniel Freeman( @bucketofkets ), @sschoenholz and @TalKachman for the great collaboration. This was a really fun project to work on! Key takeaway: take gradients with care. Just because you can backprop doesn’t mean you should! 7/7
1
0
46
@zzznah
Alex Mordvintsev
3 years
@Luke_Metz @sschoenholz What do you think about this paper?
2
0
9
@followML_
FollowML
3 years
Tweet media one
0
0
1
@examachine1
Examachine Mk I 🤖⏫️
3 years
@Luke_Metz Can we please start using proper grammar and titles for papers? :D
0
0
1
@ChrisRackauckas
Dr. Chris Rackauckas
3 years
@Luke_Metz There are alternative differentiation techniques which are stable to the exploding gradients of chaotic systems. Check out this thread which explains how the #julialang @SciML_Org tools include chaos-robust adjoints.
@ChrisRackauckas
Dr. Chris Rackauckas
3 years
4th of July, we're freeing you from chaos. #julialang @SciML_Org is releasing AD capable of handling chaotic systems. Why does standard AD fail? Because you cannot trust the solution! Look at Float32 vs Float64: O(1) difference => O(1) derivative error.
2
68
243
1
2
38
@josephdviviano
Joseph Viviano
3 years
@Luke_Metz Thanks for this awesome paper, takes the legs off a lot of assumptions I was taking for granted
0
0
1
@bjornsing
Björn Smedman
3 years
@Luke_Metz Interesting read! But I have a feeling flat / difficult to explore loss landscapes is more of an issue in practice... Do you know any papers that explore that? Like what sort of problems are solvable with overparameterized gradient descent, and which are not?
0
0
1
@connerver
Conner Vercellino
3 years
@Luke_Metz Really cool work! Your blog post on chaos like behavior in a simple "metalearning" is still one of my favorites. . Have you seen: ?
0
0
1
@TAndersen_nSCIr
Tom Andersen
3 years
@Luke_Metz @farrwill So it's not all downhill from here?
0
0
2
@DurstewitzLab
DurstewitzLab
3 years
@Luke_Metz Interesting work! Seems closely related to a similar analysis we posted recently: (our analysis is based on Lyapunov exponents, one conclusion is exploding gradients cannot be avoided in chaotic systems)
0
0
4
@tomamoral
Moreau Thomas
3 years
@Luke_Metz Super cool and useful paper! Thanks for sharing. In a recent preprint w/ @BenoitMX_ML and M. Kowalski, we also study a similar phenomenon in a particular case: the unrolled ISTA. (an updated version of this work will be released soon)
Tweet media one
1
0
2
@ludgerpaehler
Ludger Paehler
3 years
@Luke_Metz Interesting paper! Some of the insights do seem to be sort of related to the earlier work of Qiqi Wang et al. on adjoints i.e. reverse-mode derivatives of chaotic systems for the purpose of sensitivity analysis e.g.
1
0
2
@NicolasChapados
Nicolas Chapados
2 years
@Luke_Metz Great paper, and a timely reminder that bad gradients remain as bad a problem as ever! One of the first discussions of this in the ML community was Bengio et al. (1994) . Would be a great complement to your existing bibliography.
0
0
0