Summary

When closed-form solutions break down (nonlinear models, large data, complex loss functions), we turn to gradient descent. GD iteratively steps in the direction of steepest descent, converging to a global minimum when the objective is convex. SGD approximates the full gradient using a single random data point, trading accuracy per step for much cheaper iterations — and the noise can even help escape shallow local minima.


We have been looking at regression problems. These are great when our matrices behave nicely, but as you can probably guess, this is only a small subset of problems we can face. There are also cases where 𝜃 is very numerically sensitive, in which case we added Regularization & Cross-Validation into the equation. We also haven’t mentioned a method to find a specific 𝜃 in the case where there are many correct solutions.

In the real world, we have tons of parameters, tons of training data, the hypothesis class is nonlinear, and the loss function is scantly as simple as squared error. So what the heck do we do? The answer is gradient descent methods.

Gradient Descent (GD)

The Gradient Vector

The gradient is defined at the point 𝑝=(𝑥1,,𝑥𝑚) as

𝑓(𝑝)=((((((((((𝜕𝑓𝜕𝑥1(𝑝)𝜕𝑓𝜕𝑥2(𝑝)𝜕𝑓𝜕𝑥𝑚(𝑝)))))))))))

It points in the direction of steepest ascent of 𝑓 at point 𝑝, and its magnitude gives the rate of increase in that direction. In general, the gradient generalizes the notion of steepest direction. The gradient can be symbolic or numerical.

Example

𝑓(𝑥,𝑦,𝑧)=𝑥2+𝑦3+𝑧

The symbolic gradient is

𝑓(𝑥,𝑦,𝑧)=((((2𝑥3𝑦21))))

Evaluating it at a point gives the numerical gradient.

Note

The gradient at a function minimizer is necessarily zero.

The Gradient Descent Algorithm

The trivial case of gradient descent is to fit a line without an offset to minimize MSE.

Example

𝐽(𝜃)=(3𝜃6)2

For our algorithm, we have hyperparameters 𝜃init (initial guess), 𝜂 (learning rate), 𝐽 (objective function), and 𝜀 (precision). We initialize 𝜃(0)=𝜃init and 𝑡=0. We repeatedly step 𝑡 while running the following function

𝜃(𝑡)=𝜃(𝑡1)𝜂𝜃𝐽(𝜃(𝑡1))

until |𝐽(𝜃(𝑡))𝐽(𝜃(𝑡1))|<𝜀, which means our objective improvement is basically zero. We then return 𝜃(𝑡). The portion is the gradient of 𝐽(𝜃(𝑡1)). 𝜂 exists to determine by how much we will adjust our new 𝜃(𝑡). The lower the 𝜂, the more precisely we will perform gradient descent, but it also comes at the cost of more compute time.

Note

There are other stopping conditions we could use:

  • Small gradient norm: 𝜃𝐽(𝜃(𝑡1))<𝜀
  • Small parameter change: 𝜃(𝑡)𝜃(𝑡1)<𝜀

Properties of Gradient Descent

We want to try the global minimizer, which is our closed-form 𝜃. Unfortunately, gradient descent can only get us to where the gradient vector is zero. This could be a local minimum, not necessarily the global minimizer. With that said, if the objective function is convex, we will always converge to the global minimum.

Definition

A function 𝑓 is convex if any line segment connecting two points of the graph of 𝑓 lies above or on the graph.

Corollary

𝑓 is concave if 𝑓 is convex.

Note

𝐽ridge with 𝜆>0 is always convex (see ridge regression).

With convexity in place, gradient descent converges arbitrarily close to a global minimizer of 𝑓.

There are some concerns to keep in mind. We need 𝑓 to be pretty smooth. If it’s not, gradient descent probably doesn’t work. Similarly, if 𝑓 is not convex (as in neural networks), we might get stuck at a saddle point or a local minimum. Further, if we don’t even have a global minimum, we may never terminate. To a similar end, if we make 𝜂 too big, it might oscillate, leading us to never reach the minimum.

Stochastic Gradient Descent (SGD)

What if our data becomes more complex and multi-dimensional? The gradient of the total objective is the average of the per-point gradients, and the total MSE is the average of the individual errors. Since each data point contributes its own gradient, we can decompose and examine them individually. Remember, using the MSE of a linear hypothesis, we have this function:

𝐽𝑖(𝜃)=(𝑦(𝑖)𝜃𝑇𝑥(𝑖))2

and its gradient w.r.t. theta:

𝜃𝐽𝑖(𝜃)=2(𝑦(𝑖)𝜃𝑇𝑥(𝑖))𝑥(𝑖)

In general, an ML objective function is a sum

𝐽(𝜃)=1𝑛𝑛𝑖=1𝐽𝑖(𝜃)

where each 𝐽𝑖 is the loss on a single data point. This has gradient

𝜃𝐽(𝜃)=𝜃(1𝑛𝑛𝑖=1𝐽𝑖(𝜃))=1𝑛𝑛𝑖=1𝜃𝐽𝑖(𝜃)

where each 𝜃𝐽𝑖(𝜃) is the gradient contribution from the 𝑖-th data point’s loss. In order to reduce computational cost, we make this approximation:

1𝑛𝑛𝑖=1𝜃𝐽𝑖(𝜃)𝜃𝐽𝑖(𝜃)

In other words, we choose a random 𝑖 and approximate the full gradient using just that one data point.

Compared with GD, SGD is significantly more efficient. However, it is also noisier since we’re estimating the gradient from a single point rather than the full dataset. This isn’t necessarily a problem, but it makes our descent look a lot more “drunk.” The upside is that this noise can actually help us escape shallow local minima that full-batch GD would get stuck in. We still get arbitrarily close to a global minimum under the same conditions (convexity, appropriate 𝜂) with probability 1.