2023-04-05 Differentiation
Contents
2023-04-05 Differentiation#
Last time#
Nonlinear models
Computing derivatives
numeric
analytic by hand
Today#
Project sharing in groups
Computing derivatives in maintainable code
Forward and reverse
Algorithmic (automatic) differentiation
using LinearAlgebra
using Plots
default(linewidth=4, legendfontsize=12)
function vander(x, k=nothing)
if isnothing(k)
k = length(x)
end
m = length(x)
V = ones(m, k)
for j in 2:k
V[:, j] = V[:, j-1] .* x
end
V
end
function vander_chebyshev(x, n=nothing)
if isnothing(n)
n = length(x) # Square by default
end
m = length(x)
T = ones(m, n)
if n > 1
T[:, 2] = x
end
for k in 3:n
#T[:, k] = x .* T[:, k-1]
T[:, k] = 2 * x .* T[:,k-1] - T[:, k-2]
end
T
end
runge(x) = 1 / (1 + 10*x^2)
runge_noisy(x, sigma) = runge.(x) + randn(size(x)) * sigma
function grad_descent(loss, grad, c0; gamma=1e-3, tol=1e-5)
"""Minimize loss(c) via gradient descent with initial guess c0
using learning rate gamma. Declares convergence when gradient
is less than tol or after 500 steps.
"""
c = copy(c0)
chist = [copy(c)]
lhist = [loss(c)]
for it in 1:500
g = grad(c)
c -= gamma * g
push!(chist, copy(c))
push!(lhist, loss(c))
if norm(g) < tol
break
end
end
(c, hcat(chist...), lhist)
end
grad_descent (generic function with 1 method)
Constructive project feedback/questions#
You can start now: look at a few projects in
#projects
on Zulip
Examples#
Explain (in a sentence or two) something you learned:
It was neat to see how interpolation showed up in this graphics application, and how B-splines were used to have both sharp corners and smooth curves in the same representation.
Ask a question about the software, the underlying math, or about how people interact with or are affected by the software.
Give a suggestion for improving or extending the project, or a new perspective by which to understand its context.
Nonlinear models#
Instead of the linear model
We will also need the gradient
Fitting a rational function#
f(x, c) = 1 ./ (c[1] .+ c[2].*x + c[3].*x.^2)
function gradf(x, c)
f2 = f(x, c).^2
[-f2 -f2.*x -f2.*x.^2]
end
function loss(c)
r = f(x, c) - y
0.5 * r' * r
end
function gradient(c)
r = f(x, c) - y
vec(r' * gradf(x, c))
end
gradient (generic function with 1 method)
x = LinRange(-1, 1, 200)
y = runge_noisy(x, .5)
c, _, lhist = grad_descent(loss, gradient, [1., 0, 10.],
gamma=1e-2)
@show c
plot(lhist, yscale=:log10)
c = [0.9114095056721961, 0.008071686190744132, 9.700373675187228]
Compare fits on noisy data#
scatter(x, y)
V = vander_chebyshev(x, 20)
plot!(x -> runge(x), color=:black, label="Runge")
plot!(x, V * (V \ y), label="Chebyshev fit")
plot!(x -> f(x, c), label="Rational fit")
How do we compute those derivatives as the model gets complicated?#
How should we choose \(h\)?
Too big: discretization error dominates (think of truncating the Taylor series)
Too small: rounding error dominates
Automatic step size selection#
Walker and Pernice
Dennis and Schnabel
diff(f, x; h=1e-8) = (f(x+h) - f(x)) / h
function diff_wp(f, x; eps=1e-8)
"""Diff using Walker and Pernice (1998) choice of step"""
h = eps * (1 + abs(x))
(f(x+h) - f(x)) / h
end
x = 1
diff_wp(sin, x) - cos(x)
-2.9698852266335507e-9
Symbolic differentiation#
using Symbolics
@variables x
Dx = Differential(x)
y = sin(x)
Dx(y)
expand_derivatives(Dx(y))
Awesome, what about products?#
y = x
for _ in 1:2
y = cos(y^pi) * log(y)
end
expand_derivatives(Dx(y))
The size of these expressions can grow exponentially
Hand-coding derivatives#
function f(x)
y = x
for _ in 1:2
a = y^pi
b = cos(a)
c = log(y)
y = b * c
end
y
end
f(1.9), diff_wp(f, 1.9)
(-1.5346823414986814, -34.032439961925064)
function df(x, dx)
y = x
dy = dx
for _ in 1:2
a = y ^ pi
da = pi * y^(pi - 1) * dy
b = cos(a)
db = -sin(a) * da
c = log(y)
dc = dy/y
y = b * c
dy = db * c + b * dc
end
y, dy
end
df(1.9, 1)
(-1.5346823414986814, -34.032419599140475)
We can go the other way#
We can differentiate a composition \(h(g(f(x)))\) as
What we’ve done above is called “forward mode”, and amounts to placing the parentheses in the chain rule like
The expression means the same thing if we rearrange the parentheses,
which we can compute with in reverse order via
A reverse mode example#
function g(x)
a = x^pi
b = cos(a)
c = log(x)
y = b * c
y
end
(g(1.4), diff_wp(g, 1.4))
(-0.32484122107701546, -1.2559760384500684)
function gback(x, y_)
a = x^pi
b = cos(a)
c = log(x)
y = b * c
# backward pass
c_ = y_ * b
b_ = c * y_
a_ = -sin(a) * b_
x_ = 1/x * c_ + pi * x^(pi-1) * a_
x_
end
gback(1.4, 1)
-1.2559761698835525
Automatic differentiation#
using Zygote
WARNING: using Zygote.gradient in module Main conflicts with an existing identifier.
Zygote.gradient(f, 1.9)
(-34.03241959914049,)
How does Zygote work?#
square(x) = x^2
@code_llvm square(1.5)
; @ In[15]:1 within `square`
define double @julia_square_11670(double %0) #0 {
top:
; ┌ @ intfuncs.jl:321 within `literal_pow`
; │┌ @ float.jl:385 within `*`
%1
= fmul double %0, %0
; └└
ret double %1
}
@code_llvm Zygote.gradient(square, 1.5)
; @ /home/jed/.julia/packages/Zygote/dABKa/src/compiler/interface.jl:95 within `gradient`
define [1 x double] @julia_gradient_11787(double %0) #0 {
top:
; @ /home/jed/.julia/packages/Zygote/dABKa/src/compiler/interface.jl:97 within `gradient`
; ┌ @ /home/jed/.julia/packages/Zygote/dABKa/src/compiler/interface.jl:45 within `#60`
; │┌ @ In[15]:1 within `Pullback`
; ││┌ @ /home/jed/.julia/packages/Zygote/dABKa/src/compiler/chainrules.jl:206 within `ZBack`
; │││┌ @ /home/jed/.julia/packages/Zygote/dABKa/src/lib/number.jl:12 within `literal_pow_pullback`
; ││││┌ @ promotion.jl:389 within `*` @ float.jl:385
%1 = fmul double %0, 2.000000e+00
; └└└└└
; @ /home/jed/.julia/packages/Zygote/dABKa/src/compiler/interface.jl:98 within `gradient`
%.fca.0.insert = insertvalue [1 x double] zeroinitializer, double %1, 0
ret [1 x double] %.fca.0.insert
}
Kinds of algorithmic differentation#
Source transformation: Fortran code in, Fortran code out
Duplicates compiler features, usually incomplete language coverage
Produces efficient code
Operator overloading: C++ types
Hard to vectorize
Loops are effectively unrolled/inefficient
Just-in-time compilation: tightly coupled with compiler
JIT lag
Needs dynamic language features (JAX) or tight integration with compiler (Zygote, Enzyme)
Some sharp bits
Forward or reverse?#
It all depends on the shape.
One input, many outputs: use forward mode
“One input” can be looking in one direction
Many inputs, one output: use reverse mode
Will need to traverse execution backwards (“tape”)
Hierarchical checkpointing
About square? Forward is usually a bit more efficient.
Ill-conditioned optimization#
Gradient of \(L\) requires the Jacobian \(J\) of the model \(f\).
We can solve \(g(c) = 0\) using a Newton method
The Hessian requires the second derivative of \(f\), which can cause problems.
Newton-like methods for optimization#
Solve
Update \(c \gets c + \gamma \delta c\)$ using a line search or trust region.
Outlook#
The optimization problem can be solved using a Newton method. It can be onerous to implement the needed derivatives.
The Gauss-Newton method (see activity) is often more practical than Newton while being faster than gradient descent, though it lacks robustness.
The Levenberg-Marquardt method provides a sort of middle-ground between Gauss-Newton and gradient descent.
Many globalization techniques are used for models that possess many local minima.
One pervasive approach is stochastic gradient descent, where small batches (e.g., 1 or 10 or 20) are selected randomly from the corpus of observations (500 in our current example), and a step of gradient descent is applied to that reduced set of observations. This helps to escape saddle points and weak local minima.
Among expressive models \(f(x,c)\), some may converge much more easily than others. Having a good optimization algorithm is essential for nonlinear regression with complicated models, especially those with many parameters \(c\).
Classification is a very similar problem to regression, but the observations \(y\) are discrete, thus
models \(f(x,c)\) must have discrete output
the least squares loss function is not appropriate.