2022-04-08 Optimization

Last time

  • Project feedback

  • Computing derivatives in maintainable code

  • Forward and reverse

  • Algorithmic (automatic) differentiation

  • Projects

Today

  • Reflect on differentiation

  • Second order (Newton type) optimization

  • Project discussion

using LinearAlgebra
using Plots
default(linewidth=4, legendfontsize=12)

function vander(x, k=nothing)
    if isnothing(k)
        k = length(x)
    end
    m = length(x)
    V = ones(m, k)
    for j in 2:k
        V[:, j] = V[:, j-1] .* x
    end
    V
end

function vander_chebyshev(x, n=nothing)
    if isnothing(n)
        n = length(x) # Square by default
    end
    m = length(x)
    T = ones(m, n)
    if n > 1
        T[:, 2] = x
    end
    for k in 3:n
        #T[:, k] = x .* T[:, k-1]
        T[:, k] = 2 * x .* T[:,k-1] - T[:, k-2]
    end
    T
end

runge(x) = 1 / (1 + 10*x^2)
runge_noisy(x, sigma) = runge.(x) + randn(size(x)) * sigma

function grad_descent(loss, grad, c0; gamma=1e-3, tol=1e-5)
    """Minimize loss(c) via gradient descent with initial guess c0
    using learning rate gamma.  Declares convergence when gradient
    is less than tol or after 500 steps.
    """
    c = copy(c0)
    chist = [copy(c)]
    lhist = [loss(c)]
    for it in 1:500
        g = grad(c)
        c -= gamma * g
        push!(chist, copy(c))
        push!(lhist, loss(c))
        if norm(g) < tol
            break
        end
    end
    (c, hcat(chist...), lhist)
end

function diff_wp(f, x; eps=1e-8)
    """Diff using Walker and Pernice (1998) choice of step"""
    h = eps * (1 + abs(x))
    (f(x+h) - f(x)) / h
end
diff_wp (generic function with 1 method)

Hand-coding derivatives

\[ df = f'(x) dx \]
function f(x)
    y = x
    for _ in 1:2
        a = y^pi
        b = cos(a)
        c = log(y)
        y = b * c
    end
    y
end 

f(1.9), diff_wp(f, 1.9)
(-1.5346823414986814, -34.032439961925064)
function df(x, dx)
    y = x
    dy = dx
    for _ in 1:2
        a = y ^ pi
        da = pi * y^(pi - 1) * dy
        b = cos(a)
        db = -sin(a) * da
        c = log(y)
        dc = 1/y * dy
        y = b * c
        dy = db * c + b * dc
    end
    y, dy
end

df(1.9, 1)
(-1.5346823414986814, -34.03241959914048)

We can go the other way

We can differentiate a composition \(h(g(f(x)))\) as

(28)\[\begin{align} \operatorname{d} h &= h' \operatorname{d} g \\ \operatorname{d} g &= g' \operatorname{d} f \\ \operatorname{d} f &= f' \operatorname{d} x. \end{align}\]

What we’ve done above is called “forward mode”, and amounts to placing the parentheses in the chain rule like

\[ \operatorname d h = \frac{dh}{dg} \left(\frac{dg}{df} \left(\frac{df}{dx} \operatorname d x \right) \right) .\]

The expression means the same thing if we rearrange the parentheses,

\[ \operatorname d h = \left( \left( \left( \frac{dh}{dg} \right) \frac{dg}{df} \right) \frac{df}{dx} \right) \operatorname d x \]

which we can compute with in reverse order via

\[ \underbrace{\bar x}_{\frac{dh}{dx}} = \underbrace{\bar g \frac{dg}{df}}_{\bar f} \frac{df}{dx} .\]

A reverse mode example

\[ \underbrace{\bar x}_{\frac{dh}{dx}} = \underbrace{\bar g \frac{dg}{df}}_{\bar f} \frac{df}{dx} .\]
function g(x)
    a = x^pi
    b = cos(a)
    c = log(x)
    y = b * c
    y
end
(g(1.4), diff_wp(g, 1.4))
(-0.32484122107701546, -1.2559760384500684)
function gback(x, y_)
    a = x^pi
    b = cos(a)
    c = log(x)
    y = b * c
    # backward pass
    c_ = y_ * b 
    b_ = c * y_
    a_ = -sin(a) * b_
    x_ = 1/x * c_ + pi * x^(pi-1) * a_
    x_
end
gback(1.4, 1)
-1.2559761698835525

Automatic differentiation

using Zygote
Zygote.gradient(f, 1.9)
(-34.03241959914049,)

How does Zygote work?

square(x) = x^2
@code_llvm square(1.5)
;  @ In[8]:1 within `square`
define double @julia_square_5092(double %0) #0 {
top:
; ┌ @ intfuncs.jl:312 within `literal_pow`
; │┌ @ float.jl:405 within `*`
    %1 = fmul double %0, %0
; └└
  ret double %1
}
@code_llvm Zygote.gradient(square, 1.5)
;  @ /home/jed/.julia/packages/Zygote/H6vD3/src/compiler/interface.jl:74 within `gradient`
define [1 x double] @julia_gradient_5209(double %0) #0 {
top:
;  @ /home/jed/.julia/packages/Zygote/H6vD3/src/compiler/interface.jl:76 within `gradient`
; ┌ @ /home/jed/.julia/packages/Zygote/H6vD3/src/compiler/interface.jl:41 within `#56`
; │┌ @ In[8]:1 within `Pullback`
; ││┌ @ /home/jed/.julia/packages/ZygoteRules/AIbCs/src/adjoint.jl:67 within `#1822#back`
; │││┌ @ /home/jed/.julia/packages/Zygote/H6vD3/src/lib/number.jl:6 within `#254`
; ││││┌ @ promotion.jl:380 within `*` @ float.jl:405
       %1 = fmul double %0, 2.000000e+00
; └└└└└
;  @ /home/jed/.julia/packages/Zygote/H6vD3/src/compiler/interface.jl:77 within `gradient`
  %.fca.0.insert = insertvalue [1 x double] zeroinitializer, double %1, 0
  ret [1 x double] %.fca.0.insert
}

Kinds of algorithmic differentation

  • Source transformation: Fortran code in, Fortran code out

    • Duplicates compiler features, usually incomplete language coverage

    • Produces efficient code

  • Operator overloading: C++ types

    • Hard to vectorize

    • Loops are effectively unrolled/inefficient

  • Just-in-time compilation: tightly coupled with compiler

    • JIT lag

    • Needs dynamic language features (JAX) or tight integration with compiler (Zygote, Enzyme)

    • Some sharp bits

Forward or reverse?

It all depends on the shape.

  • One input, many outputs: use forward mode

    • “One input” can be looking in one direction

  • Many inputs, one output: use reverse mode

    • Will need to traverse execution backwards (“tape”)

    • Hierarchical checkpointing

  • Small intermediate results?

    • Break into pieces and use forward/reverse for the smaller pieces.

  • About square? Forward is usually a bit more efficient.

Can you differentiate an algorithm?

  • Input \(c\), output \(x\) such that \(f(x,c) = 0\)

  • Input \(A\), output \(\lambda\) such that \(A x = \lambda x\) for some nonzero vector \(x\)

  • Input buffer, output sha256(buffer)

Ill-conditioned optimization

\[ L(c; x, y) = \frac 1 2 \lVert \underbrace{f(x, c) - y}_{r(c)} \rVert_{C^{-1}}^2 \]

Gradient of \(L\) requires the Jacobian \(J\) of the model \(f\).

\[ g(c) = \nabla_c L = r^T \underbrace{\nabla_c f}_{J} \]

We can solve \(g(c) = 0\) using a Newton method

\[ g(c + \delta c) = g(c) + \underbrace{\nabla_c g}_H\delta c + O((\delta c)^2) \]

The Hessian requires the second derivative of \(f\), which can cause problems.

\[ H = J^T J + r^T (\nabla_c J)\]

Newton-like methods for optimization

Solve

\[ H \delta c = -g(c) \]

Update \(c \gets c + \gamma \delta c\) using a line search or trust region.

  • Gauss-Newton: \(H = J^T J\)

  • Levenberg-Marquardt: \(H = J^T J + \alpha I\)

Outlook

  • The optimization problem can be solved using a Newton method. It can be onerous to implement the needed derivatives.

  • The Gauss-Newton method is often more practical than Newton while being faster than gradient descent, though it lacks robustness.

  • The Levenberg-Marquardt method provides a sort of middle-ground between Gauss-Newton and gradient descent.

  • Many globalization techniques are used for models that possess many local minima.

  • One pervasive approach is stochastic gradient descent, where small batches (e.g., 1 or 10 or 20) are selected randomly from the corpus of observations (500 in our current example), and a step of gradient descent is applied to that reduced set of observations. This helps to escape saddle points and weak local minima.

  • Among expressive models \(f(x,c)\), some may converge much more easily than others. Having a good optimization algorithm is essential for nonlinear regression with complicated models, especially those with many parameters \(c\).

  • Classification is a very similar problem to regression, but the observations \(y\) are discrete, thus

  • models \(f(x,c)\) must have discrete output

  • the least squares loss function is not appropriate.

  • Why momentum really works

Team project workshopping

  1. Identify one or more software packages and execution environments that will be used.

  2. Identify one or more stakeholders in this problem space.

  3. Identify one or more decisions that depend on the use or behavior of the methods or software.

  4. What will you do?

  • Identify a test you can conduct to inform the decision. Think about the stakeholders and context.

  • Create something new using the software and reflect on how it works.

  • Make a contribution to improve the software or community (diagnostics, decumentation, efficiency). (Please share your ideas with me before choosing this.)

  1. Create a thread in #projects summarizing the above and naming your group members.