Regularization & Cross-Validation

Summary

When is singular or nearly singular, the OLS closed-form breaks down. Ridge regression fixes this by penalizing large , introducing a hyperparameter . Cross-validation provides a principled way to choose by evaluating on held-out data.

Continuing from Intro and Linear Regression, recall the OLS objective in matrix-vector form:

Note

All objective functions must solve to a scalar.

By matrix calculus and optimization, we found

This is only possible when is invertible.

The Problem with Closed-Form

When there isn’t a specific minimum, there’s no closed-form solution.

Example

Suppose we have two features and , but for every data point. We’re fitting . Then and both give the same predictions, since .

When (often occurs in genomics, NLP), this issue can occur. Similarly, when features are linearly-dependent (e.g. age vs. birth year), there’s a perfect line of best fit, so the same issue can occur. In these cases, our loss function no longer is bowl-shaped. In other words, the closed-form formula becomes undefined, so there are now infinitely many optimal .

Mathematically, the columns of are linearly dependent, so is singular (determinant is zero, no inverse exists) and is undefined is not full column rank.

	not full column rank	full column rank
Loss surface	Flat bottom	Curves up (bowl-shaped)
Closed-form	Not well-defined	Well-defined
Optimal	Infinitely many	Unique

Almost-Singularity

Sometimes we have situations where is almost singular. For instance, asymptotically reaches 0 but never does. Technically, the formula is well-defined and gives a unique hyperplane, but minor perturbations in the data are made, the plane changes significantly. also tends to have huge magnitude, and lots of other s fit the training data almost equally well.

Regularization & Ridge Regression

Our goal is to mitigate this issue of almost-singularity by penalizing large in our objective, a.k.a. regularization. Remember that large magnitude of leads to easily-perturbed hyperplanes, which is not good.

The ridge objective is:

The final term is our penalty, where controls how much we penalize magnitude relative to mean squared error.

Solving gives:

For , is always invertible, so always exists uniquely.

Note

is a hyperparameter — something used as input for a regression but isn’t learned. The engineer sets this.

How do we set this ?

Cross-Validation

We can’t use training error to pick — it would just set (no regularization, best fit to training data). And we don’t have access to test data.

The idea: hold out part of the training data as validation data , train on the rest, and evaluate on the held-out portion. For each candidate , compute the validation error , then choose the that minimizes it.

K-Fold Cross-Validation

A single held-out set could be unrepresentative. K-fold cross-validation addresses this:

Split training data into equal folds
For each fold : train on all folds except , compute validation error on fold
Average across folds:

Choose the that minimizes

Why average?

A single held-out set could be unrepresentative. Averaging over folds gives a more reliable estimate of how well generalizes.

Then, use that in the aforementioned calculation of .

Alexander's Notes

Explorer

Regularization & Cross-Validation

The Problem with Closed-Form

Almost-Singularity

Regularization & Ridge Regression

Cross-Validation

K-Fold Cross-Validation

Graph View

Table of Contents

Recent Notes

How To Template Zettelkasten In Obsidian + Practical Note-Taking

We Turned Ourselves Into Robots

Amortized Analysis

Competitive Analysis

Obsidian Web Clipper + AI Capture & Summarise Web Content (📽️YouTube, Reddit, 📚Books & More)

Shadow Coding Better Than Vibe Coding

Unlocking Your Intuition How to Solve Hard Problems Easily

how to start programming c