2022-04-01 Nonlinear Regression

Last time

  • Discuss projects

  • Gradient-based optimization for linear models

  • Effect of conditioning on convergence rate

Today

  • Nonlinear models

  • Computing derivatives

    • numeric

    • analytic by hand

    • algorithmic (automatic) differentiation

using LinearAlgebra
using Plots
default(linewidth=4, legendfontsize=12)

function vander(x, k=nothing)
    if isnothing(k)
        k = length(x)
    end
    m = length(x)
    V = ones(m, k)
    for j in 2:k
        V[:, j] = V[:, j-1] .* x
    end
    V
end

function vander_chebyshev(x, n=nothing)
    if isnothing(n)
        n = length(x) # Square by default
    end
    m = length(x)
    T = ones(m, n)
    if n > 1
        T[:, 2] = x
    end
    for k in 3:n
        #T[:, k] = x .* T[:, k-1]
        T[:, k] = 2 * x .* T[:,k-1] - T[:, k-2]
    end
    T
end

function chebyshev_regress_eval(x, xx, n)
    V = vander_chebyshev(x, n)
    vander_chebyshev(xx, n) / V
end

runge(x) = 1 / (1 + 10*x^2)
runge_noisy(x, sigma) = runge.(x) + randn(size(x)) * sigma

CosRange(a, b, n) = (a + b)/2 .+ (b - a)/2 * cos.(LinRange(-pi, 0, n))

vcond(mat, points, nmax) = [cond(mat(points(-1, 1, n))) for n in 2:nmax]
vcond (generic function with 1 method)

Gradient descent

Instead of solving the least squares problem using linear algebra (QR factorization), we could solve it using gradient descent. That is, on each iteration, we’ll take a step in the direction of the negative gradient.

function grad_descent(loss, grad, c0; gamma=1e-3, tol=1e-5)
    """Minimize loss(c) via gradient descent with initial guess c0
    using learning rate gamma.  Declares convergence when gradient
    is less than tol or after 500 steps.
    """
    c = copy(c0)
    chist = [copy(c)]
    lhist = [loss(c)]
    for it in 1:500
        g = grad(c)
        c -= gamma * g
        push!(chist, copy(c))
        push!(lhist, loss(c))
        if norm(g) < tol
            break
        end
    end
    (c, hcat(chist...), lhist)
end
grad_descent (generic function with 1 method)

Quadratic model

A = [1 1; 1 20]
@show cond(A)
loss(c) = .5 * c' * A * c
grad(c) = A * c

c, chist, lhist = grad_descent(loss, grad, [.9, .9],
    gamma=.09)
plot(lhist, yscale=:log10, xlims=(0, 80))
cond(A) = 21.163274649089434
../_images/2022-04-01-nonlinear-regression_5_1.svg
plot(chist[1, :], chist[2, :], marker=:circle)
x = LinRange(-1, 1, 30)
contour!(x, x, (x,y) -> loss([x, y]))
../_images/2022-04-01-nonlinear-regression_6_0.svg

Chebyshev regression via optimization

x = LinRange(-1, 1, 200)
sigma = 0.5; n = 8
y = runge_noisy(x, sigma)
V = vander_chebyshev(x, n)
function loss(c)
    r = V * c - y
    .5 * r' * r
end
function grad(c)
    r = V * c - y
    V' * r
end
c, _, lhist = grad_descent(loss, grad, ones(n),
    gamma=0.008)
c
8-element Vector{Float64}:
  0.2574425176426252
 -0.07677183142880879
 -0.45869836761594074
 -0.03424918979535217
  0.0015455111399231755
 -0.04040206116680208
 -0.2646122776404021
 -0.022719353732199264
c0 = V \ y
l0 = 0.5 * norm(V * c0 - y)^2
@show cond(V' * V)
plot(lhist .- l0, yscale=:log10)
#plot!(i -> l0, color=:black)
cond(V' * V) = 9.259357328426905
../_images/2022-04-01-nonlinear-regression_9_1.svg

Why use QR vs gradient-based optimization?

Nonlinear models

Instead of the linear model

f(x,c)=V(x)c=c0+c1xT1(x)+c2T2(x)+
let’s consider a rational model with only three parameters
f(x,c)=1c1+c2x+c3x2=(c1+c2x+c3x2)1.
We’ll use the same loss function
L(c;x,y)=12f(x,c)y2.

We will also need the gradient

cL(c;x,y)=(f(x,c)y)Tcf(x,c)
where

(24)f(x,c)c1=(c1+c2x+c3x2)2=f(x,c)2f(x,c)c2=(c1+c2x+c3x2)2x=f(x,c)2xf(x,c)c3=(c1+c2x+c3x2)2x2=f(x,c)2x2.

Fitting a rational function

f(x, c) = 1 ./ (c[1] .+ c[2].*x + c[3].*x.^2)
function gradf(x, c)
    f2 = f(x, c).^2
    [-f2 -f2.*x -f2.*x.^2]
end
function loss(c)
    r = f(x, c) - y
    0.5 * r' * r
end
function gradient(c)
    r = f(x, c) - y
    vec(r' * gradf(x, c))
end
gradient (generic function with 1 method)
c, _, lhist = grad_descent(loss, gradient, [1., 0, 10.],
    gamma=1e-2)
@show c
plot(lhist, yscale=:log10)
c = [0.9360587066416243, 0.49240962714220576, 10.23410523331987]
../_images/2022-04-01-nonlinear-regression_14_1.svg

Compare fits on noisy data

scatter(x, y)
V = vander_chebyshev(x, 20)
plot!(x -> runge(x), color=:black, label="Runge")
plot!(x, V * (V \ y), label="Chebyshev fit")
plot!(x -> f(x, c), label="Rational fit")
../_images/2022-04-01-nonlinear-regression_16_0.svg

How do we compute those derivatives as the model gets complicated?

limh0f(x+h)f(x)h
  • How should we choose h?

Taylor series

Classical accuracy analysis assumes that functions are sufficiently smooth, meaning that derivatives exist and Taylor expansions are valid within a neighborhood. In particular,

f(x+h)=f(x)+f(x)h+f(x)h22!+f(x)h33!+O(h3).

The big-O notation is meant in the limit h0. Specifically, a function g(h)O(hp) (sometimes written g(h)=O(hp)) when there exists a constant C such that

g(h)Chp
for all sufficiently small h.

Rounding error

We have an additional source of error, rounding error, which comes from not being able to compute f(x) or f(x+h) exactly, nor subtract them exactly. Suppose that we can, however, compute these functions with a relative error on the order of ϵmachine. This leads to

f~(x)=f(x)(1+ϵ1)f~(xh)=f~((x+h)(1+ϵ2))=f((x+h)(1+ϵ2))(1+ϵ3)=[f(x+h)+f(x+h)(x+h)ϵ2+O(ϵ22)](1+ϵ3)=f(x+h)(1+ϵ3)+f(x+h)xϵ2+O(ϵmachine2+ϵmachineh)

Tedious error propagation

|f~(x+h)f~(x)hf(x+h)f(x)h|=|f(x+h)(1+ϵ3)+f(x+h)xϵ2+O(ϵmachine2+ϵmachineh)f(x)(1+ϵ1)f(x+h)+f(x)h||f(x+h)ϵ3|+|f(x+h)xϵ2|+|f(x)ϵ1|+O(ϵmachine2+ϵmachineh)h(2max[x,x+h]|f|+max[x,x+h]|fx|ϵmachine+O(ϵmachine2+ϵmachineh)h=(2max|f|+max|fx|)ϵmachineh+O(ϵmachine)

where we have assumed that hϵmachine. This error becomes large (relative to f – we are concerned with relative error after all)

  • f is large compared to f

  • x is large

  • h is too small

Automatic step size selection

  • Walker and Pernice

  • Dennis and Schnabel

diff(f, x; h=1e-8) = (f(x+h) - f(x)) / h

function diff_wp(f, x; eps=1e-8)
    """Diff using Walker and Pernice (1998) choice of step"""
    h = eps * (1 + abs(x))
    (f(x+h) - f(x)) / h
end

x = 1
diff_wp(sin, x) - cos(x)
-2.9698852266335507e-9

Symbolic differentiation

using Symbolics

@variables x
Dx = Differential(x)

y = sin(x)
Dx(y)
dsin(x)dx
expand_derivatives(Dx(y))
cos(x)

Awesome, what about products?

y = x
for _ in 1:3
    y = cos(y^pi) * log(y)
end
expand_derivatives(Dx(y))
Double exponent: use braces to clarify
  • The size of these expressions can grow exponentially

Hand-coding derivatives

df=f(x)dx
function f(x)
    y = x
    for _ in 1:2
        a = y^pi
        b = cos(a)
        c = log(y)
        y = b * c
    end
    y
end

f(1.9), diff_wp(f, 1.9)
(-1.5346823414986814, -34.032439961925064)
function df(x, dx)
    y = x
    dy = dx
    for _ in 1:2
        a = y ^ pi
        da = pi * y^(pi - 1) * dy

        
    end
    y, dy
end

df(1.9, 1)
(-1.5346823414986814, -34.03241959914048)

We can go the other way

We can differentiate a composition h(g(f(x))) as

(25)dh=hdgdg=gdfdf=fdx.

What we’ve done above is called “forward mode”, and amounts to placing the parentheses in the chain rule like

dh=dhdg(dgdf(dfdxdx)).

The expression means the same thing if we rearrange the parentheses,

dh=(((dhdg)dgdf)dfdx)dx.

Automatic differentiation

using Zygote
WARNING: using Zygote.gradient in module Main conflicts with an existing identifier.
Zygote.gradient(f, 1.9)
(-34.03241959914049,)

How does Zygote work?

square(x) = x^2
@code_llvm square(1.5)
;  @ In[26]:1 within `square`
define double @julia_square_11120(double %0) #0 {
top:
; ┌ @ intfuncs.jl:312 within `literal_pow`
; │┌ @ float.jl:405 within `*`
    %1 = fmul double %0, %0
; └└
  ret double %1
}
@code_llvm Zygote.gradient(square, 1.5)
;  @ /home/jed/.julia/packages/Zygote/H6vD3/src/compiler/interface.jl:74 within `gradient`
define [1 x double] @julia_gradient_11254(double %0) #0 {
top:
;  @ /home/jed/.julia/packages/Zygote/H6vD3/src/compiler/interface.jl:76 within `gradient`
; ┌ @ /home/jed/.julia/packages/Zygote/H6vD3/src/compiler/interface.jl:41 within `#56`
; │┌ @ In[26]:1 within `Pullback`
; ││┌ @ /home/jed/.julia/packages/ZygoteRules/AIbCs/src/adjoint.jl:67 within `#1822#back`
; │││┌ @ /home/jed/.julia/packages/Zygote/H6vD3/src/lib/number.jl:6 within `#254`
; ││││┌ @ promotion.jl:380 within `*` @ float.jl:405
       %1 = fmul double %0, 2.000000e+00
; └└└└└
;  @ /home/jed/.julia/packages/Zygote/H6vD3/src/compiler/interface.jl:77 within `gradient`
  %.fca.0.insert = insertvalue [1 x double] zeroinitializer, double %1, 0
  ret [1 x double] %.fca.0.insert
}