New paper: when to use gradients
DL researchers often compute derivatives though just about everything (physics simulators, optimization procedures, renderers). Sometimes these gradients are useful, other times they are not.
We explore why.
1/7
We show that when computing a gradient through an iterative system, we need to compute terms which consist of a product of the state transition Jacobian. This product is what causes issues.
If the Jacobian's eigenvalues are > 1, gradients explode. < 1, gradients vanish 😱
2/7
We demonstrate exploding gradients in physics simulation, molecular dynamics, and learned optimization.
In the absence of noise, the loss surface can be high curvature, causing large gradients. While averaging smooths the loss, the grad variance still grows exponentially.
3/7
Looking at the Jacobian of 2 different initializations, we see that the stable initializations have smaller eigenvalues, and thus more controlled gradients.
4/7
So what to do about this? A few approaches. One is to use truncated backprop through time. While this somewhat works, it’s finicky and less performant than simply training *without* gradients.
5/7
Getting rid of gradients leads to faster training?!?
While this seems counterintuitive, consider trying to optimize a wiggly function convolved with a Gaussian. The more wiggly the function, the higher the grad variance.
With blackbox/evolution, variance remains constant.
6/7
Thanks Daniel Freeman(
@bucketofkets
),
@sschoenholz
and
@TalKachman
for the great collaboration. This was a really fun project to work on!
Key takeaway: take gradients with care. Just because you can backprop doesn’t mean you should!
7/7
@Luke_Metz
There are alternative differentiation techniques which are stable to the exploding gradients of chaotic systems. Check out this thread which explains how the
#julialang
@SciML_Org
tools include chaos-robust adjoints.
4th of July, we're freeing you from chaos.
#julialang
@SciML_Org
is releasing AD capable of handling chaotic systems.
Why does standard AD fail? Because you cannot trust the solution! Look at Float32 vs Float64: O(1) difference => O(1) derivative error.
@Luke_Metz
Interesting read! But I have a feeling flat / difficult to explore loss landscapes is more of an issue in practice... Do you know any papers that explore that? Like what sort of problems are solvable with overparameterized gradient descent, and which are not?
@Luke_Metz
Interesting work! Seems closely related to a similar analysis we posted recently:
(our analysis is based on Lyapunov exponents, one conclusion is exploding gradients cannot be avoided in chaotic systems)
@Luke_Metz
Super cool and useful paper! Thanks for sharing.
In a recent preprint w/
@BenoitMX_ML
and M. Kowalski, we also study a similar phenomenon in a particular case: the unrolled ISTA.
(an updated version of this work will be released soon)
@Luke_Metz
Interesting paper! Some of the insights do seem to be sort of related to the earlier work of Qiqi Wang et al. on adjoints i.e. reverse-mode derivatives of chaotic systems for the purpose of sensitivity analysis
e.g.
@Luke_Metz
Great paper, and a timely reminder that bad gradients remain as bad a problem as ever! One of the first discussions of this in the ML community was Bengio et al. (1994) . Would be a great complement to your existing bibliography.