# 2022-03-14 Regression¶

## Today¶

• Discussion

• Linear models

• Loss functions and partial derivatives

using LinearAlgebra
using Plots
default(linewidth=4, legendfontsize=12)

function vander(x, k=nothing)
if isnothing(k)
k = length(x)
end
m = length(x)
V = ones(m, k)
for j in 2:k
V[:, j] = V[:, j-1] .* x
end
V
end

function vander_chebyshev(x, n=nothing)
if isnothing(n)
n = length(x) # Square by default
end
m = length(x)
T = ones(m, n)
if n > 1
T[:, 2] = x
end
for k in 3:n
#T[:, k] = x .* T[:, k-1]
T[:, k] = 2 * x .* T[:,k-1] - T[:, k-2]
end
T
end

runge(x) = 1 / (1 + 10*x^2)

CosRange(a, b, n) = (a + b)/2 .+ (b - a)/2 * cos.(LinRange(-pi, 0, n))

vcond(mat, points, nmax) = [cond(mat(points(-1, 1, n))) for n in 2:nmax]

vcond (generic function with 1 method)


The expected error in our approximation $$\hat f(x)$$ of noisy data $$y = f(x) + \epsilon$$ (with $$\epsilon \sim \mathcal N(0, \sigma)$$), can be decomposed as

$E[(\hat f(x) - y)^2] = \sigma^2 + \big(\underbrace{E[\hat f(x)] - f(x)}_{\text{Bias}}\big)^2 + \underbrace{E[\hat f(x)^2] - E[\hat f(x)]^2}_{\text{Variance}} .$
The $$\sigma^2$$ term is irreducible error (purely due to observation noise), but bias and variance can be controlled by model selection. More complex models are more capable of expressing the underlying function $$f(x)$$, thus are capable of reducing bias. However, they are also more affected by noise, thereby increasing variance.

# Regression using polynomials¶

function chebyshev_regress_eval(x, xx, n)
V = vander_chebyshev(x, n)
vander_chebyshev(xx, n) / V
end

runge(x) = 1 / (1 + 10*x^2)
runge_noisy(x, sigma) = runge.(x) + randn(size(x)) * sigma

x = LinRange(-1, 1, 500)
ytrain = runge_noisy(x, 0.5)
yfit = chebyshev_regress_eval(x, x, 7) * ytrain
size(ytrain), size(yfit)

((500,), (500,))

plot(x, runge.(x), label="runge(x)")
plot!(x, yfit, label="yfit")
scatter!(x, ytrain, markersize=2)

ytest = runge_noisy(x, 0.5)
@show norm(yfit - ytrain)
@show norm(yfit - ytest);

norm(yfit - ytrain) = 11.260549155614475
norm(yfit - ytest) = 11.397657573604622


# What happens as we increase polynomial degree?¶

ks = 2:2:20
p = plot(legend=:none)
function residuals(k)
# Fit polynomial of degree k to ytrain.
yfit = chebyshev_regress_eval(x, x, k) * ytrain


# Variance over the realizations¶

function variance(Y)
"""Compute the Variance as defined at the top of this activity"""
## BEGIN SOLUTION
n = size(Y, 2)
sum(Y.^2, dims=2)/n - (sum(Y, dims=2) / n) .^2
## END SOLUTION
end

Yvar = variance(Y)
@show size(Yvar)
plot(x, Yvar)

size(Yvar) = (500, 1)

@assert size(variance(Y)) == (size(Y, 1), 1)


# Another take on the Runge phenomenon¶

The fact that variance blows up toward the end of our interval is a property of the approximation space (polynomials). Recall that it doesn’t depend on the basis used for fitting (Chebyshev in this case); that choice only relates to stability. If we could choose an approximation space such that variance was flat across the interval $$[-1, 1]$$, we would be able to solve interpolation problems on equally spaced grids without numerical artifacts like the Runge phenomenon. Finding spaces of functions have flat variance and are rich enough to approximate interesting functions is “hard” (math speak for has no general solution). It is possible in special circumstances, such as for periodic functions, in which the Fourier basis (sine and cosine functions) can be used.

In practice, we often use regularization to modify the least squares objective such that we can reduce variance while using function spaces rich enough to keep bias low.