New paper: when to use gradients DL researchers often compute derivatives though just about everything (physics simulators, optimization procedures, renderers). Sometimes these gradients are useful, other times they are not. We explore why. 1/7 Tweet added by Luke Metz @Luke_Metz

Luke Metz

3 years

New paper: when to use gradients DL researchers often compute derivatives though just about everything (physics simulators, optimization procedures, renderers). Sometimes these gradients are useful, other times they are not. We explore why. 1/7

Gradients are Not All You Need

Differentiable programming techniques are widely used in the community and are responsible for the machine learning renaissance of the past several decades. While these methods are powerful, they...

arxiv.org

13

158

796

Luke Metz

@Luke_Metz

3 years

We show that when computing a gradient through an iterative system, we need to compute terms which consist of a product of the state transition Jacobian. This product is what causes issues. If the Jacobian's eigenvalues are > 1, gradients explode. < 1, gradients vanish 😱 2/7

3

1

34

Luke Metz

@Luke_Metz

3 years

We demonstrate exploding gradients in physics simulation, molecular dynamics, and learned optimization. In the absence of noise, the loss surface can be high curvature, causing large gradients. While averaging smooths the loss, the grad variance still grows exponentially. 3/7

1

0

31

Luke Metz

@Luke_Metz

3 years

Looking at the Jacobian of 2 different initializations, we see that the stable initializations have smaller eigenvalues, and thus more controlled gradients. 4/7

1

0

18

Luke Metz

@Luke_Metz

3 years

So what to do about this? A few approaches. One is to use truncated backprop through time. While this somewhat works, it’s finicky and less performant than simply training *without* gradients. 5/7

1

0

20

Luke Metz

@Luke_Metz

3 years

Getting rid of gradients leads to faster training?!? While this seems counterintuitive, consider trying to optimize a wiggly function convolved with a Gaussian. The more wiggly the function, the higher the grad variance. With blackbox/evolution, variance remains constant. 6/7

2

3

34

Luke Metz

@Luke_Metz

3 years

Thanks Daniel Freeman( @bucketofkets ), @sschoenholz and @TalKachman for the great collaboration. This was a really fun project to work on! Key takeaway: take gradients with care. Just because you can backprop doesn’t mean you should! 7/7

1

0

46

Alex Mordvintsev

@zzznah

3 years

@Luke_Metz @sschoenholz What do you think about this paper?

GitHub - YilingQiao/diffarticulated: Efficient Differentiable Simulation of Articulated Bodies...

Efficient Differentiable Simulation of Articulated Bodies (ICML2021) - YilingQiao/diffarticulated

github.com

2

0

9

FollowML

@followML_

3 years

@Luke_Metz

0

1

Examachine Mk I 🤖⏫️

@examachine1

3 years

@Luke_Metz Can we please start using proper grammar and titles for papers? :D

0

1

Dr. Chris Rackauckas

@ChrisRackauckas

3 years

@Luke_Metz There are alternative differentiation techniques which are stable to the exploding gradients of chaotic systems. Check out this thread which explains how the #julialang @SciML_Org tools include chaos-robust adjoints.

Dr. Chris Rackauckas

@ChrisRackauckas

3 years

4th of July, we're freeing you from chaos. #julialang @SciML_Org is releasing AD capable of handling chaotic systems. Why does standard AD fail? Because you cannot trust the solution! Look at Float32 vs Float64: O(1) difference => O(1) derivative error.

2

68

243

1

2

38

Joseph Viviano

@josephdviviano

3 years

@Luke_Metz Thanks for this awesome paper, takes the legs off a lot of assumptions I was taking for granted

0

1

Björn Smedman

@bjornsing

3 years

@Luke_Metz Interesting read! But I have a feeling flat / difficult to explore loss landscapes is more of an issue in practice... Do you know any papers that explore that? Like what sort of problems are solvable with overparameterized gradient descent, and which are not?

0

1

Conner Vercellino

@connerver

3 years

@Luke_Metz Really cool work! Your blog post on chaos like behavior in a simple "metalearning" is still one of my favorites. . Have you seen: ?

0

1

Tom Andersen

@TAndersen_nSCIr

3 years

@Luke_Metz @farrwill So it's not all downhill from here?

0

2

DurstewitzLab

@DurstewitzLab

3 years

@Luke_Metz Interesting work! Seems closely related to a similar analysis we posted recently: (our analysis is based on Lyapunov exponents, one conclusion is exploding gradients cannot be avoided in chaotic systems)

On the difficulty of learning chaotic dynamics with RNNs

Recurrent neural networks (RNNs) are wide-spread machine learning tools for modeling sequential and time series data. They are notoriously hard to train because their loss gradients backpropagated...

arxiv.org

0

4

Moreau Thomas

@tomamoral

3 years

@Luke_Metz Super cool and useful paper! Thanks for sharing. In a recent preprint w/ @BenoitMX_ML and M. Kowalski, we also study a similar phenomenon in a particular case: the unrolled ISTA. (an updated version of this work will be released soon)

1

0

2

Ludger Paehler

@ludgerpaehler

3 years

@Luke_Metz Interesting paper! Some of the insights do seem to be sort of related to the earlier work of Qiqi Wang et al. on adjoints i.e. reverse-mode derivatives of chaotic systems for the purpose of sensitivity analysis e.g.

Least Squares Shadowing sensitivity analysis of chaotic limit...

The adjoint method, among other sensitivity analysis methods, can fail in chaotic dynamical systems. The result from these methods can be too large, often by orders of magnitude, when the result...

arxiv.org

1

0

2

Nicolas Chapados

@NicolasChapados

2 years

@Luke_Metz Great paper, and a timely reminder that bad gradients remain as bad a problem as ever! One of the first discussions of this in the ML community was Bengio et al. (1994) . Would be a great complement to your existing bibliography.

0

Replies