Summary
When closed-form solutions break down (nonlinear models, large data, complex loss functions), we turn to gradient descent. GD iteratively steps in the direction of steepest descent, converging to a global minimum when the objective is convex. SGD approximates the full gradient using a single random data point, trading accuracy per step for much cheaper iterations — and the noise can even help escape shallow local minima.
We have been looking at regression problems. These are great when our matrices behave nicely, but as you can probably guess, this is only a small subset of problems we can face. There are also cases where is very numerically sensitive, in which case we added Regularization & Cross-Validation into the equation. We also haven’t mentioned a method to find a specific in the case where there are many correct solutions.
In the real world, we have tons of parameters, tons of training data, the hypothesis class is nonlinear, and the loss function is scantly as simple as squared error. So what the heck do we do? The answer is gradient descent methods.
Gradient Descent (GD)
The Gradient Vector
The gradient is defined at the point as
It points in the direction of steepest ascent of at point , and its magnitude gives the rate of increase in that direction. In general, the gradient generalizes the notion of steepest direction. The gradient can be symbolic or numerical.
Example
The symbolic gradient is
Evaluating it at a point gives the numerical gradient.
Note
The gradient at a function minimizer is necessarily zero.
The Gradient Descent Algorithm
The trivial case of gradient descent is to fit a line without an offset to minimize MSE.
Example
For our algorithm, we have hyperparameters (initial guess), (learning rate), (objective function), and (precision). We initialize and . We repeatedly step while running the following function
until , which means our objective improvement is basically zero. We then return . The portion is the gradient of . exists to determine by how much we will adjust our new . The lower the , the more precisely we will perform gradient descent, but it also comes at the cost of more compute time.
Note
There are other stopping conditions we could use:
- Small gradient norm:
- Small parameter change:
Properties of Gradient Descent
We want to try the global minimizer, which is our closed-form . Unfortunately, gradient descent can only get us to where the gradient vector is zero. This could be a local minimum, not necessarily the global minimizer. With that said, if the objective function is convex, we will always converge to the global minimum.
Definition
A function is convex if any line segment connecting two points of the graph of lies above or on the graph.
Corollary
is concave if is convex.
Note
with is always convex (see ridge regression).
With convexity in place, gradient descent converges arbitrarily close to a global minimizer of .
There are some concerns to keep in mind. We need to be pretty smooth. If it’s not, gradient descent probably doesn’t work. Similarly, if is not convex (as in neural networks), we might get stuck at a saddle point or a local minimum. Further, if we don’t even have a global minimum, we may never terminate. To a similar end, if we make too big, it might oscillate, leading us to never reach the minimum.
Stochastic Gradient Descent (SGD)
What if our data becomes more complex and multi-dimensional? The gradient of the total objective is the average of the per-point gradients, and the total MSE is the average of the individual errors. Since each data point contributes its own gradient, we can decompose and examine them individually. Remember, using the MSE of a linear hypothesis, we have this function:
and its gradient w.r.t. theta:
In general, an ML objective function is a sum
where each is the loss on a single data point. This has gradient
where each is the gradient contribution from the -th data point’s loss. In order to reduce computational cost, we make this approximation:
In other words, we choose a random and approximate the full gradient using just that one data point.
Compared with GD, SGD is significantly more efficient. However, it is also noisier since we’re estimating the gradient from a single point rather than the full dataset. This isn’t necessarily a problem, but it makes our descent look a lot more “drunk.” The upside is that this noise can actually help us escape shallow local minima that full-batch GD would get stuck in. We still get arbitrarily close to a global minimum under the same conditions (convexity, appropriate ) with probability 1.