Summary

We framed machine learning as making decisions/predictions from data and defined the three major problem classes. We then focused on supervised learning and regression, particularly ordinary least-square regression. We derived the closed-form OLS solution, then extended it with offsets, centering, and ridge regression (regularization). Finally, we covered how to evaluate learning algorithms via cross-validation and hyperparameter tuning.

See also Linear Regression Lab (practice problems + evaluation).


Generative AI has permeated the world for much longer than LLMs. In medicine and science, we have been using AI to diagnose diseases and make discoveries. In robotics, vision-language-action models learn from demos + practice rather than deterministic programs.

At its core, machine learning is about making decisions or predictions based on data. This is distinct from statistics (which focuses on model fitting and inference) or economics/psychology (which seek causal understanding); ML prioritizes prediction quality. But human engineering is still critical: framing problems, acquiring data, choosing hypothesis spaces, and understanding deployment impacts.

Traditionally, we have 3 buckets:

  1. Supervised learning - learns from labeled data (input-output pairs); predicts outputs for new inputs (e.g., classification, regression)
  2. Unsupervised learning - finds patterns in unlabeled data; discovers hidden structure (e.g., clustering, dimensionality reduction)
  3. Reinforcement learning - based on rewards, adjusts along the way; agent learns through trial and error in an environment (e.g., PPO in MIT Pokerbots 2026)

Intersections:

  • Supervised + Unsupervised: Self-supervised / contrastive learning (DALL-E)
  • Unsupervised + Reinforcement: Unsupervised skill discovery
  • Supervised + Reinforcement: Imitation learning
  • All three: World models

Other settings that blur the lines:

  • Semi-supervised learning - supervised training set supplemented with additional unlabeled data from the same distribution
  • Active learning - assumes labeling is expensive; algorithm strategically picks which inputs to request labels for
  • Transfer learning / meta-learning - experience from prior related tasks reduces data requirements for new ones

Topics in order:

  • Intro to ML
  • Regression & Regularization
  • Gradient Descent
  • Linear Classification
  • Features, Neural Networks
  • Neural Networks II (Backpropagation)
  • Convolutional Neural Networks
  • Representation Learning, Transformers
  • Markov Decision Processes
  • Reinforcement Learning
  • Non-parametric Models

First four weeks are the linear stuff. Then deep neural networks for five, two for RL, and one to touch on every piece.

Supervised Learning Terminology

Supervised Learning

Given inputs and correct outputs, learn a function that maps inputs to outputs.

Regression allows us to predict a continuous scalar number. For instance, predicting a city’s average energy usage. Classification is the other main supervised problem; it predicts a discrete label rather than a continuous value. Classifications can be binary (two classes, e.g. spam vs. not spam) or multi-class (e.g. classifying handwritten digits 0–9).

To achieve regression (or classification), we may collect data called training data.

𝐷train{(𝑥(1),𝑦(1)),,(𝑥(𝑛),𝑦(𝑛))}

Example: Predicting Energy Usage from Temperature

For Chicago (data point 1):

𝑥(1)=(35)1,𝑦(1)=920

With multiple features (e.g., temperature and humidity):

𝑥(1)=(3565)2,𝑦(1)=920

The full training set with 4 cities:

𝐷train={(𝑥(1),𝑦(1)),(𝑥(2),𝑦(2)),(𝑥(3),𝑦(3)),(𝑥(4),𝑦(4))}

Each x is called the feature vector (vector) and each y is called the label (scalar). We collect 𝑛 data points which each have 𝑑-dimensional features and scalar label. We assume data points are i.i.d. (independent and identically distributed); they all come from the same underlying probability distribution but are otherwise unrelated. Test data is assumed to come from this same distribution.

We represent data in matrix-vector form if we have multiple parameters:

  — (x^((1)))^top —;
  — (x^((2)))^top —;
  dots.v;
  — (x^((n)))^top —
) in RR^(n times d), quad y = vec(y^((1)), y^((2)), dots.v, y^((n))) in RR^n$$
The "width" of our matrix is based on how many features we have as inputs (temperature, weather, etc.). $n times d$ = n things, d features

In practice, raw inputs (images, songs, people) aren't naturally numerical. A feature representation function $phi$ converts raw inputs into numerical feature vectors: $phi: "raw input" arrow.r RR^d$. This data preprocessing step can significantly affect model performance.

We put our training data into a regression algorithm to produce a **hypothesis** $h$, which is a function. This function then takes in new features and can be used to predict more things. 

We want our regression algorithm to be as accurate as possible, i.e. a *good* hypothesis. 

![[attachments/temp-energy-bad-hypothesis.svg]]

A bad hypothesis like the one above is valid but poorly represents the relationship between the variables. The dashed red lines show the **error** between each prediction and actual value; we want to minimize this error. 
- **Loss**: $cal(L)(h(x^((i))), y^((i)))$ — measures how unhappy we are when prediction $h(x^((i)))$ differs from truth $y^((i))$
- **Training Error**: $epsilon_("train")(h) = 1/n sum_(i=1)^(n) cal(L)(h(x^((i))), y^((i)))$
- **Test Error**: $epsilon_("test")(h) = 1/n' sum_(i=n+1)^(n+n') cal(L)(h(x^((i))), y^((i)))$
	- This is most important. We want to minimize this.

Common loss functions:
- Squared loss: $cal(L)(g, a) = (g - a)^2$ - standard for regression, penalizes large errors heavily
- Absolute loss: $cal(L)(g, a) = |g - a|$ - more robust to outliers than squared
- 0-1 loss: $cal(L)(g, a) = cases(0 &"if" g = a, 1 &"otherwise")$ - used in classification, counts misclassifications
- Asymmetric loss - when different mistakes carry different costs (e.g. failing to predict a heart attack is far worse than a false alarm)
A **hypothesis class** $cal(H)$ is the set of $h$ we ask the algorithm to search over. 
> [!example] Example
> 
> - $cal(H) = {"constant functions"}$
> - Constant functions (less expressive) $subset$ linear functions (more expressive)

### Parametric vs Non-Parametric Models
- Parametric models learn a fixed set of parameters $Theta$ from training data, then use those to make predictions. Once trained, the original data isn't needed. Linear regression is parametric; we learn $theta$ and predict with $h(x; theta)$.
- Non-parametric models make predictions directly from the training data without learning fixed parameters. Example: *nearest neighbor* methods average answers from the most similar training examples.

This course mostly focuses on parametric models until the non-parametric section near the end.

### Ordinary Least Squares Regression
The historical problem (c. 1800) was to infer orbit parameters from partial measurements (Gauss predicting Ceres' orbit). The idea is to pick parameters that make predictions match observations as closely as possible by minimizing the sum of squared *residuals*. Even today, this is still a fast and reliable baseline.

Despite all the deep learning hype, OLS remains widely used in finance and social sciences because of **interpretability**—the parameters have meaningful interpretations as coefficients/factors explaining how much each feature influences the output. When making decisions that affect people's lives, understanding *why* a model makes a prediction matters. 

![[attachments/piazzi-ceres-measurements.png|400]]
*Piazzi's observations of Ceres (1801)*

h(x; theta) = [ theta_1, theta_2,…,theta_d] vec(x_1, x_2, …, x_n)

**Squared Loss**:

cal(L)(h(x^((i))), y^((i))) = (h(x^((i))) - y^((i)))


We derive the OLS solution in four steps:
1) Write the training error $J(theta)$ as a scalar
2) We rearrange into matrix-vector form: $J(theta) = 1/n (X theta - Y)^top (X theta - Y)$
	1) $J(theta)$ usually looks like a bowl
3) Set the gradient $gradient_theta J$ to zero
	1) Expand: $J(theta) = 1/n (theta^top X^top X theta - 2 theta^top X^top y + y^top y)$
	2) Differentiate using matrix calculus rules:
		- $d/(d theta) (theta^top A theta) = 2 A theta$ (A symmetric) $arrow.r d/(d theta) (theta^top X^top X theta) = 2 X^top X theta$
		- $d/(d theta) (theta^top b) = b$ $arrow.r d/(d theta) (2 theta^top X^top y) = 2 X^top y$
		- $d/(d theta) (y^top y) = 0$
	3) Result: $gradient_theta J = 1/n (2 X^top X theta - 2 X^top y) = 2/n X^top (X theta - y)$
	4) Set to zero: $X^top (X theta - y) = 0$
4) Solve to optimal parameters

> [!example] Example: OLS with City Energy Data (temperature, humidity)
> 
> 
> **Step 1:** Write training error as scalar sum
> $$
> J(theta) = 1/4 sum_(i=1)^4 (theta_1 x_1^((i)) + theta_2 x_2^((i)) - y^((i)))^2
> $$
> $$
> = 1/4 [(35 theta_1 + 70 theta_2 - 920)^2 + (42 theta_1 + 65 theta_2 - 850)^2 + (48 theta_1 + 60 theta_2 - 780)^2 + (72 theta_1 + 50 theta_2 - 480)^2]
> $$
> 
> Two different "linear" things here:
> - Hypothesis is linear in $x$: $h(x) = theta_1 x_1 + theta_2 x_2$ (no $x^2$ terms — gives a flat plane)
> - $J$ is quadratic in $theta$: expanding $(...)^2$ gives $theta_1^2, theta_2^2, theta_1 theta_2$ terms (bowl-shaped — has a single minimum)
> 
> **Step 2:** Matrix-vector form
> $$
> X = mat(35, 70; 42, 65; 48, 60; 72, 50), quad y = vec(920, 850, 780, 480), quad theta = vec(theta_1, theta_2)
> $$
> $$
> J(theta) = 1/4 (X theta - y)^top (X theta - y)
> $$
> 
> **Step 3:** Set gradient to zero
> $$
> gradient_theta J = 2/n X^top (X theta - y) = 0
> $$
> 
> **Step 4:** Solve for $theta^*$
> $$
> X^top X theta = X^top y
> $$
> $$
> theta^* = (X^top X)^(-1) X^top y
> $$

> [!info] Why $1/n$?
> 
> We average the loss (divide by $n$) so we're not penalized for having more data points. Without normalization, a dataset with 1000 points would have a much larger error than one with 3 points.

When $theta^*$ is well-defined, it is the unique minimizer of $J(theta)$. This is a closed-form solution and doesn't feel like "training", though this is a very rare case where we get this clean solution with a nice theoretical guarantee.

> [!info] Pseudo-inverse interpretation
> 
> The matrix $(X^top X)^(-1) X^top$ is called the pseudo-inverse $X^+$. It "pseudo-solves" $X theta = y$ (which is generally overdetermined, i.e. no exact solution). Geometrically, $X (X^top X)^(-1) X^top$ projects $y$ onto the column space of $X$, so OLS finds the $theta$ whose prediction $X theta$ is the closest point to $y$ in that column space.

> [!tip] The "Magic" of Closed-Form Solutions
> 
> Unlike deep learning, where you babysit the system, tune hyperparameters, debug convergence, and still don't know if your parameters are optimal, OLS just requires plugging in data. No iterative training, no convergence issues, and a **guaranteed** optimal result (in the training error sense). This is exceptionally rare in ML.

### Handling the Offset Term

The derivation above assumed $h(x; theta) = theta^top x$, a hyperplane through the origin. In general, we want an offset (intercept):

h(x; theta, theta_0) = theta^top x + theta_0


For $d = 1$ this is the familiar $y = m x + b$. For $d = 2$ it's a plane in 3D that doesn't have to pass through the origin.

The trick is to append a "fake feature" of 1 to every input:

x_“aug” = vec(x_1, dots.v, x_d, 1) in RR^(d+1), quad theta_“aug” = vec(theta_1, dots.v, theta_d, theta_0) in RR^(d+1)


Then $theta_"aug"^top x_"aug" = theta_1 x_1 + ... + theta_d x_d + theta_0 = theta^top x + theta_0$. The augmented data matrix just gets a column of ones:

X_“aug” = [X | bb(1)] in RR^(n times (d+1))


Now we use the same OLS formula with the augmented quantities:

theta_“aug”^* = (X_“aug”^top X_“aug”)^(-1) X_“aug”^top y


### Centering

An alternative to augmentation: subtract the mean of each feature from all data points, and subtract the mean label from all labels.

After centering:
- Each column of $X$ sums to zero: $X^top bb(1) = 0$
- Label mean is zero: $bb(1)^top y = 0$

Plugging into the augmented OLS formula, the offset $theta_0^*$ naturally becomes zero for centered data. Intuitively, the best-fitting plane goes through the origin when the data is centered around it. So we can just solve the simpler $theta^* = (X^top X)^(-1) X^top y$ without worrying about the offset.

> [!note] Derivation: Centering eliminates the offset
> 
> With augmented data, the normal equations give us a block system:
> $$
> theta_"aug"^* = mat(X^top X, X^top bb(1); bb(1)^top X, bb(1)^top bb(1))^(-1) vec(X^top y, bb(1)^top y)
> $$
> After centering, $X^top bb(1) = 0$ and $bb(1)^top y = 0$, so this simplifies to:
> $$
> theta_"aug"^* = mat((X^top X)^(-1), 0; 0, 1/n) vec(X^top y, 0) = vec((X^top X)^(-1) X^top y, 0)
> $$
> The last entry (the offset) is zero.

### Regularization

If all we cared about was small training loss, we wouldn't need regularization. But our real goal is performing well on *unseen* data.

When features are highly correlated (e.g. $x_2 approx x_1$), many different hyperplanes can achieve similar training MSE. Mathematically, $X^top X$ becomes nearly singular, so $(X^top X)^(-1)$ blows up with huge, unstable parameter values. A tiny change in the data can completely flip the learned model.

The general ML objective adds a regularization term:

J(Theta) = underbrace(1/n sum_(i=1)^n cal(L)(h(x^((i)); Theta), y^((i))), “loss (fit the data)”) + underbrace(lambda R(Theta), “regularizer (stay simple)”)


The default strategy is to regularize toward zero: $R(Theta) = ||Theta||^2$, which prefers smaller parameter values. $lambda >= 0$ controls the tradeoff between fitting the data and keeping the model simple.

### Ridge Regression

Ridge regression applies $L_2$ regularization to OLS:

J_“ridge” (theta, theta_0) = 1/n sum_(i=1)^n (theta^top x^((i)) + theta_0 - y^((i)))^2 + lambda ||theta||


We don't penalize $theta_0$; the offset just "floats" the regression surface to the right level for the data. We shouldn't make it harder to fit a dataset where the $y$ values tend to be around a million vs. around one.

For centered data ($theta_0 = 0$):

J_“ridge” (theta) = 1/n sum_(i=1)^n (theta^top x^((i)) - y^((i)))^2 + lambda ||theta||


> [!note] Derivation: Ridge regression closed-form
> 
> Take the gradient and set to zero:
> $$
> gradient_theta J_"ridge" = 2/n X^top (X theta - y) + 2 lambda theta = 0
> $$
> Rearranging:
> $$
> 1/n X^top X theta - 1/n X^top y + lambda theta = 0
> $$
> $$
> X^top X theta + n lambda theta = X^top y
> $$
> $$
> (X^top X + n lambda I) theta = X^top y
> $$
> $$
> theta_"ridge" = (X^top X + n lambda I)^(-1) X^top y
> $$

> [!info] Why is $(X^top X + n lambda I)$ always invertible?
> 
> $X^top X$ is positive semidefinite, so its eigenvalues ${gamma_i} >= 0$. Adding $n lambda I$ shifts every eigenvalue to $gamma_i + n lambda > 0$ (when $lambda > 0$). All eigenvalues strictly positive means $det(X^top X + n lambda I) = product (gamma_i + n lambda) > 0$, so the matrix is invertible. This is the whole point; regularization fixes the invertibility problem that correlated features cause.

Larger $lambda$ pushes $theta$ closer to zero (simpler model, more structural error). Smaller $lambda$ lets parameters fit the data more freely (more estimation error). Finding the right $lambda$ is a model selection problem.

### Sources of Error
When evaluating a hypothesis, two distinct sources of error come into play:
1. Structural error - arises when no hypothesis in $cal(H)$ can represent the true relationship well (e.g. data generated by a sine wave, but we're fitting a line)
2. Estimation error - arises from insufficient data or inability to find the best $h in cal(H)$ given what we have

These are in tension: a more expressive $cal(H)$ reduces structural error but increases estimation error (and vice versa). When we increase $lambda$, we tend to increase structural error but decrease estimation error.

### Evaluating Learning Algorithms

A learning algorithm is a procedure that takes a dataset $D_n$ and returns a hypothesis $h in cal(H)$. Parameters like $lambda$ in ridge regression are called *hyperparameters*; they control the algorithm's behavior but don't appear in the final hypothesis (they're not part of $theta$ or $theta_0$).

#### Validation
Split your data into a training set and a validation set (non-overlapping). Train on the training set, evaluate on the validation set. Repeat multiple times to control for unlucky splits.

#### Cross-Validation
A more data-efficient approach:
1. Divide $D$ into $k$ roughly equal chunks $D_1, ..., D_k$
2. For $i = 1$ to $k$:
	- Train $h_i$ on everything except $D_i$
	- Compute validation error $epsilon_i (h_i)$ on the withheld $D_i$
3. Return the average: $1/k sum_(i=1)^k epsilon_i (h_i)$

Cross-validation evaluates the *learning algorithm*, not any single hypothesis. Each fold trains a different $h_i$, so no single model is being scored.

#### Hyperparameter Tuning
To choose a hyperparameter like $lambda$, try a range of values and pick the one with the best cross-validation score. Each value of $lambda$ defines a different learning algorithm, and cross-validation tells us which one generalizes best.