2022-03-11 Noisy data

2022-03-11 Noisy data¶

Last time¶

Discussion
Compare accuracy and conditioning of splines
Generalization: boundary value problems
Cost for interpolation in higher dimensions

Today¶

Noisy data
Bias-variance tradeoff
Experiments with this notebook (https://github.com/cu-numcomp/spring22/blob/main/slides/2022-03-11-noisy-data.ipynb)

using LinearAlgebra
using Plots
default(linewidth=4, legendfontsize=12)

function vander(x, k=nothing)
    if isnothing(k)
        k = length(x)
    end
    m = length(x)
    V = ones(m, k)
    for j in 2:k
        V[:, j] = V[:, j-1] .* x
    end
    V
end

function vander_chebyshev(x, n=nothing)
    if isnothing(n)
        n = length(x) # Square by default
    end
    m = length(x)
    T = ones(m, n)
    if n > 1
        T[:, 2] = x
    end
    for k in 3:n
        #T[:, k] = x .* T[:, k-1]
        T[:, k] = 2 * x .* T[:,k-1] - T[:, k-2]
    end
    T
end

function interp_nearest(x, s)
    A = zeros(length(s), length(x))
    for (i, t) in enumerate(s)
        loc = nothing
        dist = Inf
        for (j, u) in enumerate(x)
            if abs(t - u) < dist
                loc = j
                dist = abs(t - u)
            end
        end
        A[i, loc] = 1
    end
    A
end

runge(x) = 1 / (1 + 10*x^2)

CosRange(a, b, n) = (a + b)/2 .+ (b - a)/2 * cos.(LinRange(-pi, 0, n))

vcond(mat, points, nmax) = [cond(mat(points(-1, 1, n))) for n in 2:nmax]

vcond (generic function with 1 method)

Lower-degree polynomials to fit noise-free data¶

We can fit \(m\) data points using an \(n < m\) dimensional space of functions. This involves solving a least squares problem for the coefficients \( \min_c \lVert V c - y \rVert \)

function chebyshev_regress_eval(x, xx, n)
    V = vander_chebyshev(x, n)
    @show cond(V)
    vander_chebyshev(xx, n) / V
end
ndata, nbasis = 50, 20
x = LinRange(-1, 1, ndata)
xx = LinRange(-1, 1, 500)
C = chebyshev_regress_eval(x, xx, nbasis)
plot(xx, [runge.(xx), C * runge.(x)])
scatter!(x, runge)

cond(V) = 4.540300267483983

../_images/2022-03-11-noisy-data_3_1.svg

S = svdvals(C)
scatter(S, yscale=:log10)

Noisy data¶

runge_noisy(x, sigma) = runge.(x) + randn(size(x)) * sigma

x = LinRange(-1, 1, 200)
y = runge_noisy(x, 0.5)
C = chebyshev_regress_eval(x, x, 20)
plot(x, [runge.(x), C * y])
scatter!(x, y, markersize=2, size=(1000, 500))

cond(V) = 3.473045048110253

../_images/2022-03-11-noisy-data_6_1.svg

What do you like?
What do you not like?

Probability distributions and simulation¶

To interpret real data, we need a model for noise. We’ve used the most common and convenient choice when creating the data above; the randn function draws from the “standard normal” or “Gaussian” distribution,

\[ p(t) = \frac{1}{\sqrt{2 \pi}} e^{-t^2/2} \]

stdnormal(t) = exp(-t^2/2.) / sqrt(2*pi)
n = 10000
w = randn(n)
histogram(w, bins=40, normalize=:density, xlims=(-4, 4))
plot!(t -> n*stdnormal(t))

../_images/2022-03-11-noisy-data_9_0.svg

Regression with noisy data¶

runge_noisy(x, sigma) = runge.(x) + randn(size(x)) * sigma

x = LinRange(-1, 1, 200)
sigma = 0.5
C = chebyshev_regress_eval(x, x, 20)
plot(x, runge.(x), color=:black)
plot!(x, [C * runge_noisy(x, sigma) for n in 1:20], legend=nothing)

cond(V) = 3.473045048110253

../_images/2022-03-11-noisy-data_11_1.svg

Bias-variance tradeoff¶

The expected error in our approximation \(\hat f(x)\) of noisy data \(y = f(x) + \epsilon\) (with \(\epsilon \sim \mathcal N(0, \sigma)\)), can be decomposed as

\[ E[(\hat f(x) - y)^2] = \sigma^2 + \big(\underbrace{E[\hat f(x)] - f(x)}_{\text{Bias}}\big)^2 + \underbrace{E[\hat f(x)^2] - E[\hat f(x)]^2}_{\text{Variance}} . \]

The \(\sigma^2\) term is irreducible error (purely due to observation noise), but bias and variance can be controlled by model selection. More complex models are more capable of expressing the underlying function \(f(x)\), thus are capable of reducing bias. However, they are also more affected by noise, thereby increasing variance.

Regression using polynomials¶

function chebyshev_regress_eval(x, xx, n)
    V = vander_chebyshev(x, n)
    vander_chebyshev(xx, n) / V
end

runge(x) = 1 / (1 + 10*x^2)
runge_noisy(x, sigma) = runge.(x) + randn(size(x)) * sigma

x = LinRange(-1, 1, 500)
ytrain = runge_noisy(x, 0.25)
yfit = chebyshev_regress_eval(x, x, 7) * ytrain
size(ytrain), size(yfit)

((500,), (500,))

plot(x, runge.(x), label="runge(x)")
plot!(x, yfit, label="yfit")
scatter!(x, ytrain, markersize=2)

../_images/2022-03-11-noisy-data_15_0.svg

ytest = runge_noisy(x, 0.25)
@show norm(yfit - ytrain)
@show norm(yfit - ytest);

norm(yfit - ytrain) = 5.778851168038822
norm(yfit - ytest) = 5.549990380953337

What happens as we increase polynomial degree?¶

ks = 2:4:50
p = plot()
function residuals(k)
    # Fit polynomial of degree k to ytrain.
    yfit = chebyshev_regress_eval(x, x, k) * ytrain
    plot!(x, yfit, label="k=$k")
    [norm(yfit - ytrain) norm(yfit - ytest)]
end

res = vcat([residuals(k) for k in ks]...)
p

../_images/2022-03-11-noisy-data_18_0.svg

@show size(res)

plot(ks, res[:,1], label="train", xlabel="polynomial degree", ylabel="residual")
plot!(ks, res[:,2], label="test")
plot!(ks, _ -> norm(runge.(x)-ytrain), label="perfect train")
plot!(ks, _ -> norm(runge.(x)-ytest), label="perfect test")

size(res) = (13, 2)

../_images/2022-03-11-noisy-data_19_1.svg

Interpretation questions¶

Think about these questions, re-run the notebook, and try to formulate an answer. Please discuss online (Piazza or with a friend).

Is “perfect train” (residual for the noisy sample of the zero-noise function) always greater than (or less than) “perfect test”?
Can you identify when we begin “overfitting” by comparing “train” with “perfect train”? Does it happen at about the same degree each time?
In the real world, we don’t have access to the zero-noise function, thus can’t mark “perfect train”. By looking at just “train” and “test”, can you identify (roughly) when we begin overfitting?

Bias and variance over multiple training sets¶

What have we just done?

We took one noisy sample of a function
Fit polynomials of increasing degree to it
Computed the residual of that fit on
- the training data
- an independent “test” sample

What happens if we repeat this process?¶

Scroll up and re-run above
We’ll do it many times below

Stacking many realizations¶

degree = 7
Y = []
for i in 1:50
    yi = runge_noisy(x, 0.25)
    push!(Y, chebyshev_regress_eval(x, x, degree) * yi)
end

Y = hcat(Y...)
@show size(Y) # (number of points in each fit, number of fits)
plot(x, Y, label=nothing);
plot!(x, runge.(x), color=:black)

size(Y) = (500, 50)

../_images/2022-03-11-noisy-data_23_1.svg

Interpretation¶

Re-run the cell above for different values of degree. (Set it back to a number around 7 to 10 before moving on.)
Low-degree polynomials are not rich enough to capture the peak of the function.
As we increase degree, we are able to resolve the peak better, but see more eratic behavior near the ends of the interval. This erratic behavior is overfitting, which we’ll quantify as variance.
This tradeoff is fundamental: richer function spaces are more capable of approximating the functions we want, but they are more easily distracted by noise.

Mean over all the realizations¶

Ymean = sum(Y, dims=2) / size(Y, 2)
plot(x, Ymean, label="\$ E[\\hat{f}(x)] \$")
plot!(x, runge.(x), label="\$ f(x) \$")

../_images/2022-03-11-noisy-data_26_0.svg

Variance over the realizations¶

function variance(Y)
    """Compute the Variance as defined at the top of this activity"""
    ## BEGIN SOLUTION
    
    ## END SOLUTION
end

Yvar = variance(Y)
@show size(Yvar)
plot(x, Yvar)

MethodError: no method matching size(::String)
Closest candidates are:
  size(::Union{QR, LinearAlgebra.QRCompactWY, QRPivoted}) at /usr/share/julia/stdlib/v1.7/LinearAlgebra/src/qr.jl:567
  size(::Union{QR, LinearAlgebra.QRCompactWY, QRPivoted}, ::Integer) at /usr/share/julia/stdlib/v1.7/LinearAlgebra/src/qr.jl:566
  size(::Union{Cholesky, CholeskyPivoted}) at /usr/share/julia/stdlib/v1.7/LinearAlgebra/src/cholesky.jl:494
  ...

Stacktrace:
 [1] top-level scope
   @ show.jl:1047
 [2] eval
   @ ./boot.jl:373 [inlined]
 [3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1196

@assert size(variance(Y)) == (size(Y, 1), 1)

MethodError: no method matching size(::String)
Closest candidates are:
  size(::Union{QR, LinearAlgebra.QRCompactWY, QRPivoted}) at /usr/share/julia/stdlib/v1.7/LinearAlgebra/src/qr.jl:567
  size(::Union{QR, LinearAlgebra.QRCompactWY, QRPivoted}, ::Integer) at /usr/share/julia/stdlib/v1.7/LinearAlgebra/src/qr.jl:566
  size(::Union{Cholesky, CholeskyPivoted}) at /usr/share/julia/stdlib/v1.7/LinearAlgebra/src/cholesky.jl:494
  ...

Stacktrace:
 [1] top-level scope
   @ In[48]:1
 [2] eval
   @ ./boot.jl:373 [inlined]
 [3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1196

Another take on the Runge phenomenon¶

The fact that variance blows up toward the end of our interval is a property of the approximation space (polynomials). Recall that it doesn’t depend on the basis used for fitting (Chebyshev in this case); that choice only relates to stability. If we could choose an approximation space such that variance was flat across the interval \([-1, 1]\), we would be able to solve interpolation problems on equally spaced grids without numerical artifacts like the Runge phenomenon. Finding spaces of functions have flat variance and are rich enough to approximate interesting functions is “hard” (math speak for has no general solution). It is possible in special circumstances, such as for periodic functions, in which the Fourier basis (sine and cosine functions) can be used.

In practice, we often use regularization to modify the least squares objective such that we can reduce variance while using function spaces rich enough to keep bias low.

2022-03-09 Higher dimensions

2022-03-14 Regression

Numerical Computation