2023-04-03 Nonlinear Regression
Contents
2023-04-03 Nonlinear Regression#
Last time#
Discuss projects
Gradient-based optimization for linear models
Effect of conditioning on convergence rate
Today#
Nonlinear models
Computing derivatives
numeric
analytic by hand
algorithmic (automatic) differentiation
using LinearAlgebra
using Plots
default(linewidth=4, legendfontsize=12)
function vander(x, k=nothing)
if isnothing(k)
k = length(x)
end
m = length(x)
V = ones(m, k)
for j in 2:k
V[:, j] = V[:, j-1] .* x
end
V
end
function vander_chebyshev(x, n=nothing)
if isnothing(n)
n = length(x) # Square by default
end
m = length(x)
T = ones(m, n)
if n > 1
T[:, 2] = x
end
for k in 3:n
#T[:, k] = x .* T[:, k-1]
T[:, k] = 2 * x .* T[:,k-1] - T[:, k-2]
end
T
end
function chebyshev_regress_eval(x, xx, n)
V = vander_chebyshev(x, n)
vander_chebyshev(xx, n) / V
end
runge(x) = 1 / (1 + 10*x^2)
runge_noisy(x, sigma) = runge.(x) + randn(size(x)) * sigma
CosRange(a, b, n) = (a + b)/2 .+ (b - a)/2 * cos.(LinRange(-pi, 0, n))
vcond(mat, points, nmax) = [cond(mat(points(-1, 1, n))) for n in 2:nmax]
vcond (generic function with 1 method)
Gradient descent#
Instead of solving the least squares problem using linear algebra (QR factorization), we could solve it using gradient descent. That is, on each iteration, we’ll take a step in the direction of the negative gradient.
function grad_descent(loss, grad, c0; gamma=1e-3, tol=1e-5)
"""Minimize loss(c) via gradient descent with initial guess c0
using learning rate gamma. Declares convergence when gradient
is less than tol or after 500 steps.
"""
c = copy(c0)
chist = [copy(c)]
lhist = [loss(c)]
for it in 1:500
g = grad(c)
c -= gamma * g
push!(chist, copy(c))
push!(lhist, loss(c))
if norm(g) < tol
break
end
end
(c, hcat(chist...), lhist)
end
grad_descent (generic function with 1 method)
Quadratic model#
A = [1 1; 1 8]
@show cond(A)
loss(c) = .5 * c' * A * c
grad(c) = A * c
c, chist, lhist = grad_descent(loss, grad, [.9, .9],
gamma=.22)
plot(lhist, yscale=:log10, xlims=(0, 80))
cond(A) = 9.46578492882319
plot(chist[1, :], chist[2, :], marker=:circle)
x = LinRange(-1, 1, 30)
contour!(x, x, (x,y) -> loss([x, y]))
Chebyshev regression via optimization#
x = LinRange(-1, 1, 200)
sigma = 0.5; n = 8
y = runge_noisy(x, sigma)
V = vander(x, n) # or vander_chebyshev
function loss(c)
r = V * c - y
.5 * r' * r
end
function grad(c)
r = V * c - y
V' * r
end
c, _, lhist = grad_descent(loss, grad, ones(n),
gamma=0.008)
c
8-element Vector{Float64}:
0.874660924654058
-0.12842683730993662
-2.0798930881443
1.017859789565291
0.9494221843839521
-0.29122263995838943
0.5595095146898109
-0.7532270031764107
c0 = V \ y
l0 = 0.5 * norm(V * c0 - y)^2
@show cond(V)
@show cond(V' * V)
plot(lhist, yscale=:log10, ylim=(15, 50))
plot!(i -> l0, color=:black)
cond(V) = 230.00549982014527
cond(V' * V) =
52902.52994792632
Why use QR vs gradient-based optimization?#
Nonlinear models#
Instead of the linear model
We will also need the gradient
Fitting a rational function#
f(x, c) = 1 ./ (c[1] .+ c[2].*x + c[3].*x.^2)
function gradf(x, c)
f2 = f(x, c).^2
[-f2 -f2.*x -f2.*x.^2]
end
function loss(c)
r = f(x, c) - y
0.5 * r' * r
end
function gradient(c)
r = f(x, c) - y
vec(r' * gradf(x, c))
end
gradient (generic function with 1 method)
c, _, lhist = grad_descent(loss, gradient, 1ones(3),
gamma=5e-2)
plot(lhist, yscale=:linear, ylim=(20, 60), title="Loss $(lhist[end])")
┌ Warning: scale linear is unsupported with Plots.GRBackend(). Choose from: [:identity, :log10]
└ @ Plots /home/jed/.julia/packages/Plots/4UTBj/src/args.jl:1662
Compare fits on noisy data#
scatter(x, y)
V = vander_chebyshev(x, 20)
plot!(x -> runge(x), color=:black, label="Runge")
plot!(x, V * (V \ y), label="Chebyshev fit")
plot!(x -> f(x, c), label="Rational fit")
How do we compute those derivatives as the model gets complicated?#
How should we choose \(h\)?
Taylor series#
Classical accuracy analysis assumes that functions are sufficiently smooth, meaning that derivatives exist and Taylor expansions are valid within a neighborhood. In particular,
The big-\(O\) notation is meant in the limit \(h\to 0\). Specifically, a function \(g(h) \in O(h^p)\) (sometimes written \(g(h) = O(h^p)\)) when there exists a constant \(C\) such that
Rounding error#
We have an additional source of error, rounding error, which comes from not being able to compute \(f(x)\) or \(f(x+h)\) exactly, nor subtract them exactly. Suppose that we can, however, compute these functions with a relative error on the order of \(\epsilon_{\text{machine}}\). This leads to
Tedious error propagation#
where we have assumed that \(h \ge \epsilon_{\text{machine}}\). This error becomes large (relative to \(f'\) – we are concerned with relative error after all)
\(f\) is large compared to \(f'\)
\(x\) is large
\(h\) is too small
Automatic step size selection#
Walker and Pernice
Dennis and Schnabel
diff(f, x; h=1e-8) = (f(x+h) - f(x)) / h
function diff_wp(f, x; h=1e-8)
"""Diff using Walker and Pernice (1998) choice of step"""
h *= (1 + abs(x))
(f(x+h) - f(x)) / h
end
x = 1000
diff_wp(sin, x) - cos(x)
-4.139506408429305e-6
Symbolic differentiation#
using Symbolics
@variables x
Dx = Differential(x)
y = sin(x)
Dx(y)
expand_derivatives(Dx(y))
Awesome, what about products?#
y = x
for _ in 1:2
y = cos(y^pi) * log(y)
end
expand_derivatives(Dx(y))
The size of these expressions can grow exponentially
Hand-coding derivatives#
function f(x)
y = x
for _ in 1:2
a = y^pi
b = cos(a)
c = log(y)
y = b * c
end
y
end
f(1.9), diff_wp(f, 1.9)
(-1.5346823414986814, -34.032439961925064)
function df(x, dx)
y = x
dy = dx
for _ in 1:2
a = y ^ pi
da = pi * y^(pi - 1) * dy
b = cos(a)
db = -sin(a) * da
c = log(y)
dc = dy / y
y = b * c
dy = db * c + b * dc
end
y, dy
end
df(1.9, 1)
(-1.5346823414986814, -34.032419599140475)
We can go the other way#
We can differentiate a composition \(h(g(f(x)))\) as
What we’ve done above is called “forward mode”, and amounts to placing the parentheses in the chain rule like
The expression means the same thing if we rearrange the parentheses,
Automatic differentiation#
import Zygote
Zygote.gradient(f, 1.9)
(-34.03241959914049,)
g(x) = exp(x) + x^2
@code_llvm Zygote.gradient(g, 1.9)
; @ /home/jed/.julia/packages/Zygote/dABKa/src/compiler/interface.jl:95 within `gradient`
define [1 x double] @julia_gradient_12688(double %0) #0 {
top:
; @ /home/jed/.julia/packages/Zygote/dABKa/src/compiler/interface.jl:96 within `gradient`
; ┌ @ /home/jed/.julia/packages/Zygote/dABKa/src/compiler/interface.jl:42 within `pullback` @ /home/jed/.julia/packages/Zygote/dABKa/src/compiler/interface.jl:44
; │┌ @ In[18]:1 within `_pullback` @ /home/jed/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:9
; ││┌ @ /home/jed/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl within `macro expansion`
; │││┌ @ /home/jed/.julia/packages/Zygote/dABKa/src/compiler/chainrules.jl:218 within `chain_rrule`
; ││││┌ @ /home/jed/.julia/packages/ChainRulesCore/C73ay/src/rules.jl:134 within `rrule` @ /home/jed/.julia/packages/ChainRules/hVHC4/src/rulesets/Base/fastmath_able.jl:56
%1
= call double @j_exp_12690(double %0) #0
; └└└└└
; @ /home/jed/.julia/packages/Zygote/dABKa/src/compiler/interface.jl:97 within `gradient`
; ┌ @ /home/jed/.julia/packages/Zygote/dABKa/src/compiler/interface.jl:45 within `#60`
; │┌ @ In[18]:1 within `Pullback`
; ││┌ @ /home/jed/.julia/packages/Zygote/dABKa/src/compiler/chainrules.jl:206 within `ZBack`
; │││┌ @ /home/jed/.julia/packages/Zygote/dABKa/src/lib/number.jl:12 within `literal_pow_pullback`
; ││││┌ @ promotion.jl:389 within `*` @ float.jl:385
%2 = fmul double %0, 2.000000e+00
; │└└└└
; │┌ @ /home/jed/.julia/packages/Zygote/dABKa/src/lib/lib.jl:17 within `accum`
; ││┌ @ float.jl:383 within `+`
%3 = fadd double %2, %1
; └└└
; @ /home/jed/.julia/packages/Zygote/dABKa/src/compiler/interface.jl:98 within `gradient`
%.fca.0.insert = insertvalue [1 x double] zeroinitializer, double %3, 0
ret [1 x double] %.fca.0.insert
}