Summary
When is singular or nearly singular, the OLS closed-form breaks down. Ridge regression fixes this by penalizing large , introducing a hyperparameter . Cross-validation provides a principled way to choose by evaluating on held-out data.
Continuing from Intro and Linear Regression, recall the OLS objective in matrix-vector form:
Note
All objective functions must solve to a scalar.
By matrix calculus and optimization, we found
This is only possible when is invertible.
The Problem with Closed-Form
When there isn’t a specific minimum, there’s no closed-form solution.
Example
Suppose we have two features and , but for every data point. We’re fitting . Then and both give the same predictions, since .
When (often occurs in genomics, NLP), this issue can occur. Similarly, when features are linearly-dependent (e.g. age vs. birth year), there’s a perfect line of best fit, so the same issue can occur. In these cases, our loss function no longer is bowl-shaped. In other words, the closed-form formula becomes undefined, so there are now infinitely many optimal .
Mathematically, the columns of are linearly dependent, so is singular (determinant is zero, no inverse exists) and is undefined is not full column rank.
| not full column rank | full column rank | |
|---|---|---|
| Loss surface | Flat bottom | Curves up (bowl-shaped) |
| Closed-form | Not well-defined | Well-defined |
| Optimal | Infinitely many | Unique |
Almost-Singularity
Sometimes we have situations where is almost singular. For instance, asymptotically reaches 0 but never does. Technically, the formula is well-defined and gives a unique hyperplane, but minor perturbations in the data are made, the plane changes significantly. also tends to have huge magnitude, and lots of other s fit the training data almost equally well.
Regularization & Ridge Regression
Our goal is to mitigate this issue of almost-singularity by penalizing large in our objective, a.k.a. regularization. Remember that large magnitude of leads to easily-perturbed hyperplanes, which is not good.
The ridge objective is:
The final term is our penalty, where controls how much we penalize magnitude relative to mean squared error.
Solving gives:
For , is always invertible, so always exists uniquely.
Note
is a hyperparameter — something used as input for a regression but isn’t learned. The engineer sets this.
How do we set this ?
Cross-Validation
We can’t use training error to pick — it would just set (no regularization, best fit to training data). And we don’t have access to test data.
The idea: hold out part of the training data as validation data , train on the rest, and evaluate on the held-out portion. For each candidate , compute the validation error , then choose the that minimizes it.
K-Fold Cross-Validation
A single held-out set could be unrepresentative. K-fold cross-validation addresses this:
- Split training data into equal folds
- For each fold : train on all folds except , compute validation error on fold
- Average across folds:
- Choose the that minimizes
Why average?
A single held-out set could be unrepresentative. Averaging over folds gives a more reliable estimate of how well generalizes.
Then, use that in the aforementioned calculation of .