This post is about the basics of the “gradient descent” method for finding the minimum of a function. I started writing it mainly to review the optimization material of lectures by Sébastien Bubeck given in Seattle. All of the material can be found elsewhere (for example, Sébastien’s book), but I can assure you that in terms of the ratio between lame jokes and mathematical formulae, this blogpost is second to none (ah yes, but in which direction?). Hopefully someone might find this slightly-less-technical, slightly-more-nonsensical presentation useful (who knows, maybe even you!).
Due to concerns of oversized luggage, here is a clickable table of contents.
- High dimensional rock bottom
- It’s all downhill from here
- Gradient flow
- Gradient descent
- Gradient descent with boundaries
- Gradient descent without knowing the time
- Gradient descent with randomness
- Gradient descent with curvature
- Gradient descent with momentum (smooth fella)
- Gradient descent with both curvature and smoothness
- Teaser trailer: Gradient descent with different geometry
High dimensional rock bottom
Here is a situation which happens to all of us in our daily lives: We are given a real-valued function and we need to find its minimizer – a point such that for every other . We often write this as . By “find the minimizer”, I mean algorithmically, i.e with a computer program; simply writing “” and calling it a day just won’t do.
This post will deal with solving this problem, starting with very simple methods and a lot of assumptions about our computational power, and then slowly descending into the madness that is convex optimization, describing more elaborate techniques that solve the problem either faster or with less assumptions.
What sort of assumptions do we make? Well, for starters, we’ll assume that we can easily calculate for any we want, and if needed, its derivatives. We’ll also implicitly assume that has a minimum, and, if it helps, that this minimum is found in some bounded ball of radius (so that the distance between any two points inside is bounded by ). Finally, we’ll assume that is what parental mathematicians call “well-behaved” (for some reason, they never seem to say “ isn’t naughty”), i.e it doesn’t change too-wildly when you vary its input by a small amount. This can be formally phrased by saying that is -Lipschitz, meaning that there is a constant such that
for all .
Since we talk about about a computer algorithm, finding a True-Minimizer of is in general impossible, due to finite precision and finite run time. But we can find a near-minimizer, which is often almost as good as a True-Minimizer. Possibly the simplest way to do this is as follows: Divide the ball into a tiny tiny grid , so that any point is at distance at most from a point in the grid; this is called an -net. Here are examples of nets and functions in one and two dimensions (higher dimensions sold separately):
Denote by the minimizer of among all points in the grid. Since is -Lipschitz, the distance between ‘s value on and its value on is small:
If we know , we can choose to be small enough so that is as close to the minimum as we want! Easy!
There are several problems with this approach, the cruellest of which is the sample complexity, or, the time needed to find the minimizer. The problem is that the number of points in an -net in dimensions scales like . For small and even moderately-sized , the number of points that the algorithm samples is enormous. That’s the curse of dimensionality for you. So although you might get away with this sort of thing in one-to-three dimensions, you can forget about it when dealing with functions of say, a thousand variables. And unfortunately, thousand-variabled-functions are everywhere these days, especially in hot buzzword-topics like Machine LearningTM.
It’s all downhill from here
Here is a limerick from the Technion’s second course in calculus for scientists:
There once was a student named Bill,(Originally in Hebrew:
Whose bucket was placed on a hill.
He kicked with all might,
Then ran out of sight;
Which way will the water spill?
סטודנט אחד ושמו חיים / בעט בטעות בדלי מים / הדלי נשפך / הסטודנט ברח / לאן יזלו המים?)The obvious answer is: Opposite the gradient! Indeed, if you ever find yourself stranded on top of a mountain in a bizarre -dimensional landscape, then the fastest way to go downhill is to go opposite the gradient.
The gradient descent algorithm is based on this basic observation, and indeed all it does (essentially) is take small steps in the direction opposite the gradient , until you find a minimizer of . You can think of this like gently rolling a ball downhill:
This assumes, of course, that is differentiable, so it’s possible to even speak of “the gradient of “. There are things you can do in case is not differentiable, but we’ll stick to this assumption for the rest of this post.
The algorithm runs in steps, and at time step remembers the current position of the ball, . The next position is obtained by taking a small step in opposite the gradient:
I haven’t stated precisely in what way these steps are taken, and indeed this is the entire art of gradient descent, but that’s the overall idea. However, the keen-eyed among you will already notice a fatal flaw even in this very general program: A function can have multiple local minima, and going downhill can cause us to become stuck in one of these pits and miss out on the true minimum:
There are various ways to try and fix this (for example, we can repeat the algorithm many times from different starting points), but by far the easiest one is to assume that has only one minimum. Now we’re guaranteed that descending along the gradient will give us a global minimum. Right? Well, no:
Indeed, in flat landscapes, going downhill goes nowhere (but it goes there extremely fast), and multi-dimensional flat landscapes can be pretty tricky. So just to be on the safe side, we’re going to assume that the function is convex. i.e for all and all ,
This is often stated as “the graph of a convex function is always below its chords” (see image above), and rules out the possibility of having such plateaus which are the minimum. Convex functions have only one minimal value, and as we’ll see, also convexity ensures that the various gradient descents indeed reach this minimum (and quickly so!). So from now on and until the end of time, we’ll always assume that is convex.
Convexity lets us bound how close a point is to being a minimizer . Define the “deficit” (or should I say, “eficit”) by
Since is the minimizer of , we always have , and our goal to minimize translates into a goal to find such that is really really small. To this end, we use a common alternative description of convexity, which says that a convex function is always above its tangent:
Mathematically, this is written as
To see this, rewrite the convexity assumption as
Subtract from both sides and divide by to get
Sending , the left-hand side becomes the directional derivative of in direction at the point . Multiplying by then gives the result, since is exactly equal to times the directional derivative of in direction .
Using this result with , we get a bound on the deficit:
Gradient flowLet’s first show that going in the direction opposite the gradient indeed gets you to the minimum. Here, by “going” I actually mean “flowing”, i.e we’ll treat the position of our ball a function of continuous time, and update it according to a differential equation. This is a very physicsy thing to do, and is something that computers have a hard time with, since unlike mathematicians, they cannot make infinitely many infinitesimally small steps in a single time moment. Indeed, a computer trying to solve a differential equation will have to discretize time in some way, and this, as we’ll see, introduces errors. But for now, the Platonic ideal of gradient flow will give us intuition as to how going downhill eventually gets us to the bottom (geez, when you put it this way, it really sounds obvious).
Formally, we say this. Let . The gradient flow is a function which starts at and is governed by the differential equation:
The hope is that if we “run” the differential equation for a long enough time , then will be very close to the minimizer . This is quite true. It can be seen by example in the following two images, which show the path of a gradient flow on both the surface plot and heatmap of a two-dimensional function:
It can also be seen by the following theorem:
Theorem: For every time ,
Proof: Let’s look at the distance between and . The rate of change of the square of that distance is
Since is a constant and does not depend on time, the left term in the inner product is just equal to . So
where the last inequality used the inequality for that we obtained by convexity of . This is a nice intermediate result: We would have really liked it if we could flow directly in the direction . Instead, all we can do is flow in the direction of , but as the above equation shows, these two vectors are positively correlated, since is always positive. So while flowing along the gradient won’t necessarily take us on the shortest route towards the minimizer, at least it’s always going in the right direction.
Anyways, now we can rearrange and integrate both sides from to :
The right-hand side is just equal to , and dividing by gives
Now, is a decreasing function of ; this can be seen, for example, by differentiating:
Thus , which proves the theorem. □
This process actually gave us another point for which the deficit is small: Since is a convex function of , we always have
so the point also satisfies the inequality stated in the theorem. This trick will come in handy in the future, where we won’t be able to guarantee that the last position is the best one.
The term is sort of necessary: If we happen to start very far off from the minimizer, we’re going to have to flow quite a lot. But it’s really not too bad, especially if we assume as before that we know that the minimizer is at some distance from the initial point; if we want an error in the function’s value, we only need to run the gradient flow for time.
Note: When time is continuous, it can always be rescaled: For example, instead of the original differential equation, we could have mandated that evolve according to , where is as large a constant as we want; this corresponds to flowing “faster” by a factor of . But as mentioned above, this continuous-time example is just supposed to serve as intuition; the next algorithm, which is indeed discrete and can actually be implemented by a real bit-chomping computer, won’t indulge in such luxuries.
Gradient descentThe classical way of solving differential equations is to discretize time. Quite informally, instead of writing , we write , and instead of the infinitesimal , we take a small time step which we denote by . Our algorithm then reduces to iterating the following difference equation:
Note: There are fancier ways of numerically solving differential equations, such as Runge-Kutta, which will arguably (and probably provably) produce more accurate discretization results. But I favor this simple single-step method, because its analysis is easier (and more importantly, because it’s the method that was covered in the summer school I attended…)
Choosing the step size is an important part in the design of the algorithm. The gradient changes from place to place, so if is too large, we can easily overstep and miss the optimal, flow-induced trajectory, giving a less-than-optimal trajectory:
Doing this can be dangerous, potentially even forcing the algorithm into an infinite loop. For example, consider the very simple case of the one-dimensional . The negative gradient in this case is just , so the difference equation becomes
The minimum of this function is of course . Suppose that at some point we just happen to arrive at . Then the next position position will be
and from this point on the algorithm will just bounce back and forth between and , missing the minimum entirely! Overshooting the minimizer will happen less often if is small.
Here is a catastrophic failure of the algorithm for the example we used so far (the image shows 100 steps of gradient descent!):
On the other hand, too small an means that we will need to perform many, many steps in order to get close to . Viewed another way, suppose that we decide on the number of steps that we are going to run the algorithm beforehand, in advance, and only then choose . If we choose too small, we aren’t going to move very far from the original starting position.
To conclude this discussion: Choosing too large can cause overstepping, meaning that we take steps which are too large in unwanted directions which do not directly lead us towards . Choosing too small gives a better approximation of the continuous differential equation, but may lead to super-slow convergence, so the final is not even close to . The gradient descent algorithm has to balance between these two types of errors: The “discretization error”, and the “forget the initial condition error”. The formal analysis is given below:
Theorem: Let be convex and -Lipschitz and let . With proper choice of , the gradient descent algorithm can find a point so that the deficit is small, i.e , in just steps.
Proof: Similarly to the gradient flow analysis, let’s look at how the distance between and improves when the algorithm took one step. By the cosine law,
Rearranging, we have
The term on the left-hand side tells us by how much the distance to improved; we want this quantity to be as negative as possible. Now, the first term on the right-hand side is always positive, and works against us; but the second term can make up for it, for as we know, is negatively correlated with . To see how this plays out, multiply by and rearrange again to see how large this correlation is:
Suppose that we run the algorithm for steps. Summing up the above expression for all times , the distances form a telescoping sum, and after a division by we get
The first term on the right-hand side is certainly smaller than , which itself is smaller than , and since is -Lipschitz, the second term is smaller than . Thus
So far we haven’t even used the convexity of , and the sums were valid for any Lipschitz function. But now, define . Then by the basic definition of convexity,
But we also saw that for convex functions, for all and . Thus each term in the right-hand side is smaller than , and combining this with the expression we got before, we reach the stunning result:
As if by divine prophecy, we see here exactly the two types of errors that we promised we would see: The first, , comes from having a step size too small to get from to in any reasonable time. You can see that it grows smaller as both and grow larger. The second, , stems from overshooting. As per our dictated intuition, the larger the step size , the larger the error; but the error also grows with the Lipschitz constant of . In other words, if can have very steep ridges and valleys, i.e it changes too quickly over short distances, these overshoots will cause us to miss the mark.
(The keen-eyed reader will also note another divine prophecy – the algorithm did not eventually take (we have almost no information on how good is), but rather used convexity and took to be the average of all points visited by the algorithm. Now, by the pigeonhole principle, one of the sample points must be better than the average, so in principle if you can easily calculate , you can take the best . But this just serves to highlight an interesting feature of gradient descent: It only needs to calculate in order to work, and not itself!)
Alright! All that’s left to do is optimize over . Choose to get
Setting the eficit to and rearranging gives
as promised by the theorem. □
That’s it for the basic gradient descent algorithm. It’s actually quite decent. The assumptions we made – that is Lipschitz and that you know its Lipschitz constant, for example – aren’t that far-fetched, and in the worst case you can always guess and/or and run the algorithm multiple times. It has the curious property that you don’t even need to be able to calculate to approximate its minimum, but rather only its gradients! This is nice in applications where computing is hard but computing is relatively easy. Finally, in terms of runtime, I think that making gradient calculations is much better than making the function calculations of our naive brute-search algorithm almost any day (well, any day that ).
The brutal efficiency of gradient descent actually caught me off guard when I first saw it. Nowhere at all does the approximation bound or the algorithm itself explicitly mention the dimension; it works just as well for and . In fact, it even works for infinite dimensional spaces, such as function spaces! assuming you have some practical way to calculate gradients and perform addition in those cases, of course. But more on that will come (much) later. For now, let’s see how gradient descent can be tweaked both to work in harsher, less-optimistic situations, and to converge faster in smoother, more forgiving conditions.
Gradient descent with boundariesUnder the Universal Declaration of the United Nations, freedom of movement is a fundamental human right. However, this is not always the case for the points in the gradient descent algorithm. Often we are interested in minimizing a function not over the entire domain , but rather over a smaller set; for example, we might wish to find the minimizing vector of length , or a vector whose sum of coordinates is , or a vector which satisfies a prescribed list of linear inequalities, . These conditions are called constraints. Alas, the original gradient descent algorithm has no knowledge of these constraints, and eagerly carries the points outside the allowed set.
Luckily, this problem has an easy solution, in case the constraint set is convex, i.e for all points and all , the point is also in .
This demand makes sense, since we are dealing with convex functions anyway, and often holds in real life situations (the three constraint examples given above all define convex sets. Actually, they define closed convex sets, and indeed we’ll assume that is closed).
The solution is straightforward: After each gradient step, we project the point back into the set . By projection, we mean that is sent to the point in that is closest to . More formally, for any set , the distance between and is defined as
If is closed then the infimum is actually a minimum (pretty much by definition), so there exists a point such that for all ; and if is convex, this point is unique (if there were two points of minimal distance, say and , then the midpoint between them would be closer to than either!). We then define the projection onto by .
Note that convexity is indeed important for uniqueness:
Our new, boundary-respecting gradient descent algorithm then simply applies at each step:
Now, you might think that with another layer wrapping the and gradient, the analysis of this new algorithm will be harder, or the convergence slower. But this happens not at all; in fact, projecting onto makes us converge faster!
Claim: For any and any , .
Proof: If then and there is nothing to prove, so we assume that is not in . Since we are only interested in distances between points, we may translate and scale both the convex body and together whichever way we want without affecting the relation between the distances in the claim, so for our general convenience, assume that and that . This means that the distance under our scaling. The vector naturally defines a hyperplane by . Like all hyperplanes, this hyperplane divides into two halfspaces: the set of all points with scalar product greater than , and set of all points with scalar product less than . As it turns out, the entire set is contained in this latter halfspace, i.e for all , .
To see this, Let and complete to an orthonormal basis . Any point written in this basis which has positive scalar product with must have . Since is convex and , if there exists such a , then for all the point
is in as well. The squared distance between and is given by
For small enough (but greater than ), all the squared terms are negligible compared to the term ; in fact, we can take so that
But this contradicts the fact that the distance between and is ! Hence all must have a negative component, and so their distance from must be greater than . □
Applying onto therefore takes us closer to every point in , and in particular, it takes us closer to the minimizer . Plugging this fact into the analysis of the previous section, we get exactly the same error bounds as in the unconstrained gradient descent algorithm. In other words, as the locals around here like to say, gradient descent eats convex constraints without salt.
(Of course, we implicitly assume here that we can calculate the projection easily. Sometimes this is easy (for example, when you are dealing with linear constraints, or projecting onto spheres), but I can imagine that in some applications the set is given rather implicitly, and you’d have to go deeper into this rabbit hole to get good approximations.)
Gradient descent without knowing the time
One complaint that often gets filed in the deep archives of the Bureau of Gradient Descent is that is given by
The problem with this expression is that it requires knowing the number of steps that the algorithm is going to take in advance. This is fine in some cases – for example, if you really just want an error guarantee of , take to be and let her rip. But sometimes it’s good to have a more adaptive algorithm, that runs and runs and runs and keeps getting better and better until you tell it to stop. In this case you do not know in advance, so what do you do with ?
Luckily, there is a very simple fix. Instead of having the step-size be constant throughout each step, use a time-dependent step-size:
When you plug this in the gradient descent algorithm, you get
The more pretentious among us might say that it’s an entire art form in itself to pick the right , but actually in this particular simple case, choosing the right is not that difficult. Here are two relatively-easy-to-analyze choices:
- In the first, take .
- In the second, set , and take if . In words, your is a step-function approximation of , with longer and longer steps as time goes on.
The calculations which show that these kind of things work aren’t that friendly, but are certainly doable if you rely on the proof of the basic gradient descent algorithm, so I won’t repeat them here. In any case it should be noted, that in some practical, real-world, actual, useful, non-theoretical applications, the total runtime is not that large anyways since empirically the algorithm may converge very fast. Also, in reactive or real-time systems, who has the patience to run a million steps of gradient descent every time they want to do something?
Gradient descent with randomnessNot surprisingly, the gradient descent algorithm assumes that we can calculate the gradient, and if we want the algorithm to really, actually, work in practice, we should also be able to calculate the gradient quickly. But real-life isn’t always nice that way, and many times calculating can be prohibitive. One real-world example is the parameter optimization of artificial neural networks. I won’t go into the details in this post (for that, you can watch the amazing “neural network” video series by 3blue1brown, found here), but the gist is that neural networks are created by “training” them, a process that involves minimizing a cost function which depends on the neural weights of the network. This cost function usually measures the error of the network on a “training set” of samples :
where is some function that measures how well the network with weights performed on the input . The usual paradigm in artificial neural networks concerning the size of the training set is “the more the merrier, and keep ’em coming!” But if there are hundreds of thousands, or even billions of samples, calculating the gradient of the cost function even at a single point can put a large hole in your time-wallet.
Note: The loss functions in neural networks usually aren’t convex. That really doesn’t stop people from doing gradient descent anyways.
Whatever will the artificial intellimancers do? One common tactic that all of us frequently use when faced with lots of work but little time to do it (apart from procrastinating), is to just not do all the work. So the intellimancers asked themselves, “Instead of going over the entire sample set in the above example, is it possible to approximate the gradient by going over just a small portion of it?”
The answer is, of course, a resounding yes! The “gradient” in “gradient descent” can actually be replaced by any vector, as long as on average that vector is equal to the gradient. Doing this is often called “stochastic gradient descent”, and it deserves its own theorem.
Theorem: For , define the difference process
where are some (random) vectors which can possibly depend on . Let be a convex function, and assume that
Then in expectation, gradient descent works! More precisely, after time steps, the stochastic gradient descent algorithm finds a point such that the expected error is smaller than .
Proof: The calculations are actually almost exactly the same as the ones we did for showing that vanilla gradient descent works. We again look at how the distance between and diminishes as a function of time:
Rearranging, we once again get
So far we haven’t done anything new, and in fact haven’t even used any property of whatsoever; the above formula is true for any set of vectors . But when take the expectation, everything falls into place: On the right-hand side, since , the second term becomes
On the left-hand side, the expectation of each term can be calculated using conditioning:
After doing this, we’ve basically eliminated all the randomness in by putting things inside an expected value, and we get the now-familiar equation
Note that we still need to take expectation in the left-hand side, since is a now a random variable itself. Also note that the role of has been taken by . You might think that this removes the Lipschitzness assumption from , but it doesn’t really; in fact, if is supposed to be a stochastic approximation of , then we really should expect to be greater than (intuitively, any time that , we’d need to have a balancing case where , but this works out in ‘s favor) . Anyway, we can now continue as we did before: By convexity for ,
and also by convexity, the right-hand side is equal to . So in total,
One again, choose to get
Setting the expected eficit to and rearranging gives
The keen-eyed skeptics among you should slightly raise an eyebrow, and wonder about the usefulness of such a bound. After all, the algorithm only says that the expectation of the error is small; how can you know whether an individual run is good or not? How can you be sure that, while the expectation is small, you won’t have to occasionally run the algorithm for a tremendous amount of time?
Luckily for us, the quantity is positive, so we can apply Markov’s inequality to it: For every ,
Choosing , we get that
You can then make sure that you get an actual good with low error by running the stochastic gradient descent algorithm in parallel, using independent randomness:
- Set , so that the expected error is smaller than .
- Run independent copies of stochastic gradient descent, each for time steps, giving final points .
- Return .
Each has a probability of at least of giving an error less than , so the probability that all of the ‘s suck is smaller than . So in order to get an error less than with probability at least , you only need to (collectively) run the algorithm for times. Not too bad!
Exercise for the reader: The above algorithm requires you to calculate , which means being able to calculate . This is something that the deterministic gradient algorithm did not require. Can you find a way to get, with exponentially high probability, a point with good error bound, without ever calculating on any point?
By the way, after all this talk and proof, we can finally see the solution to the gradient computation problem in artificial neural networks. Instead of calculating the entire gradient of the cost function which goes over the entire training set , pick at random some small subset of it, and calculate the gradient of the cost function only on those elements. In expectation, the two behave the same, and if the cost function is not naughty, the bound will not be too large. Then just apply the above theorem. Actually, for not to be large, you might need to take quite a lot of samples. There are techniques for combining the deterministic gradient descent and the stochastic one together, but I won’t talk about them in this post.
It’s been a while since we last saw pretty pictures, so here is a simulation showing the difference between ordinary gradient descent and stochastic gradient descent (the vectors are obtained from by simply adding a Gaussian vector as noise).
Gradient descent with curvatureWhen somebody says “convex function”, what pops up in your head? If you’re anything like me (my condolences), it’s probably something like this:
That’s fine, but you will do well to remember that even a linear function is convex:
The difference between the two images, of course, is curvature, and sometimes it makes sense to distinguish between functions which have curvature (such as or ) and functions which do not, such as . Indeed, by definition, a convex function is always found above its tangent, and in this sense a linear function is the least-convex of all convex functions: The inequality is so poor, it’s actually an equality, since the function is its own tangent. This calls for a stronger notion of convexity, which we will aptly call “strong convexity”.
A function is called -strongly convex if we can write
where itself is a convex function. In words, a strongly-convex function is at least as curved as a parabola. In particular, this means that the function is quadratically greater than its tangent:
Claim: If is an -strongly convex function, then
Proof: Luckily we already did all the work for proving this! Rearranging the definition of strong convexity, we have that
is a convex function. We already proved earlier that convex functions are above their tangent, so
The gradient of at is just equal to , and plugging this in and rearranging yields
The last term on the right-hand side is just , so
By the cosine law, this is exactly what we want! □
Being strongly convex can strongly help us when performing gradient descent, and the stronger, the better. Intuitively, the strong-convexity condition forces the gradient to be very large whenever we are far away from the minimum. This means we can take to be relatively small (so that we don’t overshoot when near the minimizer), and still be able to take large steps when we are far away from from the minimizer. With any luck (or, in this case, prior knowledge of the proofs), this will lead to faster convergence, so that less steps are needed in the algorithm to achieve the same amount of accuracy.
To see just how useful curvature can be, let’s consider first gradient flow, just like we did for the unstrong case. We start at some , and let evolve according to the differential equation
As before, a good way to analyze this equation is to look at the square distance of from the true minimizer .
As before, since is a constant and does not depend on time, the left term in the inner product is just equal to . So
Now we can use the above claim about strong-convexity to bound the size of the scalar product ; after slight rearranging, we get
Since is always positive, we arrive at a very pleasant differential inequality:
In particular, the quantity decays faster than if there was an equality, which corresponds to a decreasing exponential!
Gradient flow converges exponentially quickly to the minimizer, which, if you ask me, is not too shabby. And while this only talks about the distance to the optimum, we can always assume that is -Lipschitz, so being -close to the minimizer means being only close the actual minimum. Thus
which is much, much better than the estimate we had for the non-curvy case.
Unfortunately, not all is perfect in the land of the discrete. We already saw earlier that gradient descent, despite its unbearable adorableness, suffers from discretization errors, and in the non-curved case, only got a eficit bound, whereas the cool and continuous gradient flow got a eficit. Sure, that’s only a square root difference, which in terms of exponentials doesn’t really count for anything, but so far nothing ensures us that gradient descent with curvature will be as good as gradient flow with curvature. Luckily, as always, the calculations aren’t too difficult.
Theorem: Let be -strongly convex and -Lipschitz, and let . With the proper choice of (which will now really depend on ), the gradient descent algorithm can find a point so that the deficit is small, i.e , in just steps!
Before I give the proof, a couple of soothing words are in order, since I’m sure that you are outraged: How can it be that the gradient flow improved from an error of to an error of when introducing curvature, but gradient descent only improved from to ?! Well, such is life. The not-so-spectacular increase stems from the fact that while having large curvature makes sure that we take (relatively) big steps when we are far away from the minimum, it doesn’t really say anything about what happens when we are close to the minimum. In that case, there is still always a looming fear of overshooting due to discretization, which means that needs to be small. This of course doesn’t happen in the continuous gradient flow, which can enjoy the extra-quick velocity unabashed and unabated. Not all is lost for the discrete gradient descent, however, and it turns out that it is possible to get exponentially good convergence, with just one extra assumption. But that’s already peeking into the future. For now, let’s see the proof, which is only a tad more clever than the original gradient descent analysis.
Proof: Way back when we were younger, we found out that
Since is -Lipschitz, this immediately reduces to
Now we use the fact that is -strongly convex: The left-hand side is bounded below by
It’s time to choose . Unlike the vanilla case, we’ll take to depend on time, and set . Plugging this choice in,
Multiplying both sides by , we are nearly there:
Why are we nearly there? Because we are now in a position to sum over all from to . In the right-hand side we (miraculously and serendipitously) get a telescopic sum, leaving only
Luckily, the first term is multiplied by 0, and the second term is negative; we can safely say then that the right-hand side is quite small:
For the left-hand side, we again use convexity: Since , the weighted sum
is a convex combination of vectors, and so
Plugging this into our bound finally gives
Setting this eficit to be equal to and rearranging finally gives
as promised. □
Some final, short remarks before we carry our descent:
- The keen-eyed reader will notice that there is no dependence on in this bound! The initial conditions have been forgotten. Technically, this is because we tailored so that the term containing goes away; intuitively, it’s because if we start far away from the minimum then we take a large step.
- There is a dependence on both in and in . This is not surprising: The larger the , the more curvature the function has, and the smaller we can take . Of course, for depressingly tiny, perhaps it’s better to use the ordinary gradient descent instead. Please do not plug in .
- Do real problems on the street actually have curvature? They might. Quadratics are abundant both in nature and in Nature. Also, the intrepid intellimancer can always artificially add to whatever function she originally intended to compute, and hope that minimizers of the modified function are at least marginally related to minimizers of the original function. All is fair in love and optimization.
Gradient descent with momentum (smooth fella)Being strongly-convex, i.e having curvature, isn’t enough to guarantee really fast convergence. To avoid overstepping, we must take small, which sort of ruins all the fun of taking large steps. One possible solution to this problem (and an interesting property regardless of strong convexity) is to assume that the function is smooth. This is the formal definition:
A function is called -smooth if is -Lipschitz. In other words, for all ,
Intuitively, this means that the gradient can’t change by too much when performing gradient descent, even if we take have a relatively large stepsize. This might let us take larger steps, at least in some situations.
Technically, we’ll need the following consequence of -smoothness:
Proposition: If is a convex, -smooth function, then
In other words, if we take a gradient descent step with stepsize , then we reduce our value (i.e get closer to the minimum) by at least . That’s not too bad!
Proof: The proof is technical, and in my opinion, not all that interesting, so you can skip it. It goes about as straightforwardly as you think it would: We express the difference of ‘s values as an integral over its derivative, and use the smoothness property (invoking Cauchy-Schwartz) to say that this must be large.
We first show that for any two points ,
This may seem at first glance to be quite identical to the inequality we had for -strongly convex functions, but don’t be fooled – the inequalities go the opposite way here (isn’t it amazing that we can extract good stuff from two opposing inequalities?). To prove the above formula, note that
thus, by the fundamental theorem of calculus,
By the Cauchy-Schwartz inequality,
and since is -smooth, the first term is also small:
Combining the two inequalities, we get
All that remains is rearrange and plug in the gradient step, using as , and as in the above inequality:
Being -smooth is a second order phenomenon – it talks about how quickly the gradient changes, not how quickly the function itself changes. Exploiting this property to directly to get a really good time bound is therefore a tad harder than exploiting properties relating to explicitly; the algorithm and proof we will see below will be the most complicated in this entire post. On the plus side, this will be a great opportunity to introduce a totally new technique to the our repertoire.
So far, the gradient descent algorithm was very dumb, and in particular, very memoryless: The new position depended only on calculations performed at , and totally disregarded the possibly rich and exciting history of the trajectory. This short-term approach will not do when trying to utilize the -smoothness of a function (well, it won’t work naively, anyway). But it turns out that there is a way to meaningfully combine knowledge about previously taken steps together with the current position, and get some strong results.
Let’s think back to the (epic) limerick that started all of these gradient shenanigans, about water flowing down a slope. It is true that water, when situated on a slope, flows downwards. However, it is also true that water, once it starts flowing, has a hard time stopping, and often splashes and crashes in torrents and waves around rocks, trees, buildings, or whatever else stands in its way. This is because it has momentum. Having momentum allows it to cross and gloss over some of the tiny details and landscape features that would otherwise cause it to detour and take a longer path towards the glorious minimum.
Computer scientists, being awed by nature as they are, have tried to imitate this behavior by running gradient descent algorithms which tend to keep going in the same direction that they moved up till now. Intuitively, if we think in physical terms, instead of having the gradient be the instantaneous velocity of a particle, we can use it as an instantaneous acceleration instead. Here is one such algorithm:
where is some time-dependent coefficient. The difference is basically the instantaneous velocity of a particle whose position is , and if we view as a (time-dependent) mass, then can be seen as the monementum of the particle. With this definition, the step of “gradient descent with momentum” is
Theorem: Let be a -smooth convex function, and let . With the right choice of , the gradient descent with momentum algorithm can find a point so that the deficit is small, i.e , in just steps!
Proof: For notational convenience (and because that’s how it’s written in my notes), denote and . Under this notation, we just have
Since is -smooth, we can use the proposition that we proved above applied to instead of . The proposition gives
and plugging this into the difference gives
By good ol’, standard, run-of-the-mill convexity, this is bounded by
That’s that for the difference between two consecutive ‘s. We can also bound a single eficit in a similar fashion, eventually yielding
Now things get a little tricky (but only a little). When we run this momentous algorithm, say for time steps, what we eventually care about is the final deficit . But the expression above contains lots of ‘s and ‘s and whatnots. Our only hope for salvation is if we can somehow get some telescopic sum concerning both and the differences . Since the terms on the right-hand side do not automatically telescope on their own, we’re going to have to help them a bit. For this purpose, let be some real numbers that we’ll choose later. Let’s look at the expression :
Multiplying both sides by and rearranging the left-hand side a bit, we get
Using the Pythagorean fact that , the scalar product term in the right-hand side can be written as
(we took and ). This has the benefit of cancelling with the norm of when plugging it into the deficit difference, so that
We’re getting somewhere! (in fact, we are nearly there). In both the left-hand side and the right-hand side, we have the differences of two terms which are very much similar. We’d be the happiest creatures on Earth if, just like in the analysis of vanilla gradient descent, they could somehow transform into telescopic sums, so that each side would sum up and practically cancel itself out, leaving us with only in the left-hand side and some not-too-large expression in the right-hand side. Since we didn’t choose yet, we even have hope for making this happen. But alas, there are two different expressions we need to telescopize, but only one degree freedom in choosing the variables . Even if we fix so that the left-hand side cancels out, won’t the right-hand side be a party-pooper and spoil our cancellations?
The situation seems grim, but do not despair. When all seems lost, that is when the Deus comes out of its Machina. For remember, our notation of ‘s and ‘s hides the fact that underneath all the calculations floating about, we still haven’t chosen , the mass at step ! We actually have two free variables to juggle, and this will give us exactly the extra leeway that we need. We’ll first choose so that the left-hand side telescopizes. When summing two consecutive time steps, say and , the coefficients are
So in order to cancel out the in the middle, we’ll take . This is just a quadratic equation, which we definitely know how to solve:
So are increasing and of the order (we basically increase the value by every time). If we set (and start counting only from onwards), then we even have , and so our promise that holds true.
As for the right-hand side, in order to get a telescope, all we have to do is to make sure that . We already consumed our freedom of , but using the fact that , the desired equality reduces to
So all we need to do is pick so that , i.e
I’ll admit, it might seem a bit counterintuitive to choose based on , since “in principle” the weights are chosen first, as part of the algorithm, while the coefficients are just a construct for our analysis of the algorithm. But that’s how we get good weights! (or, at least, good theoretical bounds for this detached analysis. As for the real world, well, real-life people making real-life choices can pick to be something else entirely, if it suits them).
Anyway, having set in stone both and , all the complicated equations above reduce to the simple inequality:
Summing from to and remembering that (very suspicious, isn’t it?), we end up with
Since (proof by Mathematica), then
Equating this to as in previous proofs gives as needed. □
That was quite a ride! Of course, this is just the tip of the iceberg, both for -smooth functions, and for gradient descent with momentum. The worst-case theoretical analysis often differs from real life, and it turns out you can be pretty smart with choices of time steps. Check out, for example, this impressive cool webpage.
Gradient descent with both curvature and smoothnessOur descent to madness is nearing its end; there’s only one more rabbit hole I’d like to peek into today. But before that, let’s recap:
- Without any assumptions, gradient descent gets close to the minimum in time .
- For -strongly convex functions, this is improved to .
- For -smooth convex functions, this is even more improved to .
- However, none of this gets close to the mighty exponential accuracy of strongly convex gradient flow.
It turns out that by combining the assumptions, gradient descent can actually be as good as gradient flow!
Theorem: If is both -strongly convex and -smooth, then there is a gradient descent algorithm which gets -close to the minimum in time .
Proof: You’re on your own here, kiddo 😉 .
Teaser trailer: Gradient descent with different geometryIf you reached this point, you must have already converted to the gradient descent deity, offering pious gifts for quick, dimension-free optimization. But it is time to reveal our darkest secret: Gradient descent is merely but a false idol of the Pure True Algorithm, an imitation, a shadow! Here are two harsh examples showcasing its imperfectness:
1) Although none of the error bounds we proved above explicitly depend on the dimension of the space we are exploring, the dimension can sometimes implicitly sneak in through one of the properties of such as the Lipschitz constant. Even a seemingly innocuous function can carry quite a heavy Lipschitz constant, as this sort-of-real-world example reveals:
The expert advice problem: Imagine that you are a layman who is clueless about what to do in life (aren’t we all?). In front of you are experts, labeled to , who are eager to help you and give you advice. At each time step , you hear out what they have to say, then choose one expert to follow. After you have made your decision, you can retroactively see which expert gave good advice and which expert gave bad advice, in the form of a vector , where if expert number was correct at time , and if expert number was wrong at time . A strainingly-contrived scenario which this sort of thing can represent is betting on who will be the winner at the Olympic games. At the beginning of the day you pick up the newspaper to read the opinion of renown sports experts; you then bet with your friends who will win the 100-meter dash; at the end of the day, you either go to sleep satisfied and (moderately) richer, or curse the person who invented physical activities.
Obviously, your goal is to minimize the number times you followed bad advice. Machine learners like to define the average regret at time , which is how much you are sorry for your choices when compared to following the advice of any one particular expert throughout your life:
Inddeed, for any particular , the expression is if the expert you followed (which is ) gave bad advice while expert number gave good advice. Just like in regular life, you want to minimize the regret.
The question is, then, how do you choose ?
A decent descent solution: Since you want to minimize something (the regret), gradient descent may be the way to go! For , define . This is a convex function, being just a scalar product between and a fixed vector. Using the projection methods we described above, we’ll restrict ourselves to the simplex, which is the convex set of all probability distributions on items:
Even though the function now depends on time, the gradient descent algorithm works just as well (you can check it in the proof way way back). Starting with , we’ll then get a series of points such that
where is the minimizer of in the set . The vector can be viewed as a convex combination of experts – it tells you how much you should listen to each expert, relative to one another. Being the minimizer of , it is actually the maximizer of , so the bound above only gets better if we replace it by a fixed expert.
So how to pick the expert ? Run the gradient descent, and at each time step, treat the vector as a probability distribution on the experts and pick one of the experts at random according to this distribution. Our bounds then show that in expectation, the regret is small:
Everything seems fine and dandy, until we remember that we finally have a concerete problem, and we have to plug in the values of and . The distance between points doesn’t bother us: The simplex is contained in a ball of radius , so we certainly have . It’s the Lipschitz constant that aches: Since is linear, we just have ; in the worst case (the best case, actually), when for all , we’ll have !
This can really, really suck, especially if there is a plethora of experts to choose from (which is true in the sports industry, but also in more specialized problems in machine learning). What’s worse, is that there are algorithms for the expert advice problem which get bounds that scale as the of the dimension! (for example, the well-known (among its friends) exponential weights algorithm). How has the gradient forsaken us?
2) We stated earlier that all gradient descent needs is a way to calculate the gradient and a way to add vectors, so that it can potentially work when the objects you are trying to minimize over are infinite-dimensional, like continuous functions. Indeed, addition and calculating the gradient is all that the basic gradient descent requires:
But this is a bit misleading: The innards of all our proofs also relied on scalar products, which, while very friendly and nice, are not a general property of infinite spaces. If the vectors / functions / objects live in weird normed spaces, whose norm does not originate from an inner product, then all the precious geometric structure on which our arguments rested is lost. Even worse, in such spaces the gradient of a function is no longer a vector, so the very sum itself: doesn’t really immediately make sense in such spaces. (For the mathematically inclined: In general normed spaces, scalar products are replaced by linear functionals, and the space of functionals is not necessarily isomorphic to original space.)
As allured, these nearly-fatal flaws can be fixed, resulting in a very general, very powerful, and (eventually) much deeper algorithm, mirror descent. The concept is actually simple: Rather than try to conduct all calculations in the original space, define a “mirror map” which carries you over to a benign, peace-loving workable space (the dual space), perform all calculations there, and then map back to the original space. So in addition to all the technical tricks and schemes we utilized in ordinary gradient descent (momentum, stochastics, strong convexity…), there is an added layer of complexity in choosing and analyzing the mirror map . The deep ends of this technique can lead to some beautiful mathematics, and also some strong results (at the very least, mirror descent can solve the expert advice problem with a bound of , which is much better than ). But, I think, perhaps this can wait for another post.