Summary

When 𝑋𝑇𝑋 is singular or nearly singular, the OLS closed-form breaks down. Ridge regression fixes this by penalizing large 𝜃, introducing a hyperparameter 𝜆. Cross-validation provides a principled way to choose 𝜆 by evaluating on held-out data.


Continuing from Intro and Linear Regression, recall the OLS objective in matrix-vector form:

𝐽(𝜃)=1𝑛(𝑋𝜃𝑌)𝑇(𝑋𝜃𝑌)

Note

All objective functions must solve to a scalar.

By matrix calculus and optimization, we found

𝜃=(𝑋𝑇𝑋)1𝑋𝑇𝑌

This is only possible when 𝑋𝑇𝑋 is invertible.

The Problem with Closed-Form

When there isn’t a specific minimum, there’s no closed-form solution.

Example

Suppose we have two features 𝑥1 and 𝑥2, but 𝑥2=2𝑥1 for every data point. We’re fitting 𝑦=𝜃1𝑥1+𝜃2𝑥2. Then 𝜃=(5,0) and 𝜃=(1,2) both give the same predictions, since 5𝑥1=1𝑥1+22𝑥1.

When 𝑛<𝑑 (often occurs in genomics, NLP), this issue can occur. Similarly, when features are linearly-dependent (e.g. age vs. birth year), there’s a perfect line of best fit, so the same issue can occur. In these cases, our loss function no longer is bowl-shaped. In other words, the closed-form formula becomes undefined, so there are now infinitely many optimal 𝜃.

Mathematically, the columns of 𝑋 are linearly dependent, so 𝑋𝑇𝑋 is singular (determinant is zero, no inverse exists) and (𝑋𝑇𝑋)1 is undefined 𝑋 is not full column rank.

𝑋 not full column rank𝑋 full column rank
Loss surfaceFlat bottomCurves up (bowl-shaped)
Closed-formNot well-definedWell-defined
Optimal 𝜃Infinitely manyUnique

Almost-Singularity

Sometimes we have situations where 𝑋𝑇𝑋 is almost singular. For instance, 1𝑥 asymptotically reaches 0 but never does. Technically, the formula is well-defined and gives a unique hyperplane, but minor perturbations in the data are made, the plane changes significantly. 𝜃 also tends to have huge magnitude, and lots of other 𝜃s fit the training data almost equally well.

Regularization & Ridge Regression

Our goal is to mitigate this issue of almost-singularity by penalizing large 𝜃 in our objective, a.k.a. regularization. Remember that large magnitude of 𝜃 leads to easily-perturbed hyperplanes, which is not good.

The ridge objective is:

𝐽ridge(𝜃)=1𝑛(𝑋𝜃𝑌)𝑇(𝑋𝜃𝑌)+𝜆𝜃2

The final term is our penalty, where 𝜆>0 controls how much we penalize magnitude relative to mean squared error.

Solving gives:

𝜃ridge=(𝑋𝑇𝑋+𝑛𝜆𝐼)1𝑋𝑇𝑌

For 𝜆>0, (𝑋𝑇𝑋+𝑛𝜆𝐼)1 is always invertible, so 𝜃ridge always exists uniquely.

Note

𝜆 is a hyperparameter — something used as input for a regression but isn’t learned. The engineer sets this.

How do we set this 𝜆?

Cross-Validation

We can’t use training error to pick 𝜆 — it would just set 𝜆=0 (no regularization, best fit to training data). And we don’t have access to test data.

The idea: hold out part of the training data as validation data 𝒟val, train on the rest, and evaluate on the held-out portion. For each candidate 𝜆, compute the validation error 𝜀val(𝜆), then choose the 𝜆 that minimizes it.

K-Fold Cross-Validation

A single held-out set could be unrepresentative. K-fold cross-validation addresses this:

  1. Split training data into 𝐾 equal folds
  2. For each fold 𝑘=1,,𝐾: train on all folds except 𝑘, compute validation error 𝜀𝑘 on fold 𝑘
  3. Average across folds:
𝜀val(𝜆)=𝜀1++𝜀𝐾𝐾
  1. Choose the 𝜆 that minimizes 𝜀val(𝜆)

Why average?

A single held-out set could be unrepresentative. Averaging over 𝐾 folds gives a more reliable estimate of how well 𝜆 generalizes.

Then, use that 𝜆 in the aforementioned calculation of 𝜃.