Descent into madness

This post is about the basics of the “gradient descent” method for finding the minimum of a function. I started writing it mainly to review the optimization material of lectures by Sébastien Bubeck given in Seattle. All of the material can be found elsewhere (for example, Sébastien’s book), but I can assure you that in terms of the ratio between lame jokes and mathematical formulae, this blogpost is second to none (ah yes, but in which direction?). Hopefully someone might find this slightly-less-technical, slightly-more-nonsensical presentation useful (who knows, maybe even you!).

Due to concerns of oversized luggage, here is a clickable table of contents.

  1. High dimensional rock bottom
  2. It’s all downhill from here
  3. Gradient flow
  4. Gradient descent
  5. Gradient descent with boundaries
  6. Gradient descent without knowing the time
  7. Gradient descent with randomness
  8. Gradient descent with curvature
  9. Gradient descent with momentum (smooth fella)
  10. Gradient descent with both curvature and smoothness
  11. Teaser trailer: Gradient descent with different geometry

High dimensional rock bottom

Here is a situation which happens to all of us in our daily lives: We are given a real-valued function f : \mathbb{R}^n \to \mathbb{R} and we need to find its minimizer – a point x^* \in \mathbb{R}^n such that f(x^*) \leq f(x) for every other x \in \mathbb{R}^n . We often write this as x^* = \mathrm{argmin}_{x\in \mathbb{R}^n }f(x) . By “find the minimizer”, I mean algorithmically, i.e with a computer program; simply writing “x^* = \mathrm{argmin} f(x) ” and calling it a day just won’t do.

This post will deal with solving this problem, starting with very simple methods and a lot of assumptions about our computational power, and then slowly descending into the madness that is convex optimization, describing more elaborate techniques that solve the problem either faster or with less assumptions.

What sort of assumptions do we make? Well, for starters, we’ll assume that we can easily calculate f(x) for any x we want, and if needed, its derivatives. We’ll also implicitly assume that f has a minimum, and, if it helps, that this minimum is found in some bounded ball B(0,\frac{1}{2}R) of radius \frac{1}{2}R (so that the distance between any two points inside is bounded by R ). Finally, we’ll assume that f is what parental mathematicians call “well-behaved” (for some reason, they never seem to say “f isn’t naughty”), i.e it doesn’t change too-wildly when you vary its input by a small amount. This can be formally phrased by saying that f is L -Lipschitz, meaning that there is a constant L > 0 such that

|f(x) - f(y)| \leq L \|x-y\|

for all x,y \in \mathbb{R}^n .

Since we talk about about a computer algorithm, finding a True-Minimizer of f is in general impossible, due to finite precision and finite run time. But we can find a near-minimizer, which is often almost as good as a True-Minimizer. Possibly the simplest way to do this is as follows: Divide the ball B(0,\frac{1}{2}R) into a tiny tiny grid G , so that any point x \in B(0,\frac{1}{2}R) is at distance at most \varepsilon from a point in the grid; this is called an \varepsilon -net. Here are examples of nets and functions in one and two dimensions (higher dimensions sold separately):

Denote by z = \mathrm{argmin}_{x \in G}f(x) the minimizer of f among all points in the grid. Since f is L -Lipschitz, the distance between f ‘s value on z and its value on x^* is small:

f(z) - f(x^*) \leq L \|x^* - z\| \leq L \varepsilon.

If we know L , we can choose \varepsilon to be small enough so that f(z) is as close to the minimum as we want! Easy!

There are several problems with this approach, the cruellest of which is the sample complexity, or, the time needed to find the minimizer. The problem is that the number of points in an \varepsilon -net in n dimensions scales like (R/\varepsilon)^n . For small \varepsilon and even moderately-sized n , the number of points that the algorithm samples is enormous. That’s the curse of dimensionality for you. So although you might get away with this sort of thing in one-to-three dimensions, you can forget about it when dealing with functions of say, a thousand variables. And unfortunately, thousand-variabled-functions are everywhere these days, especially in hot buzzword-topics like Machine LearningTM.

It’s all downhill from here

Here is a limerick from the Technion’s second course in calculus for scientists:

There once was a student named Bill,
Whose bucket was placed on a hill.
He kicked with all might,
Then ran out of sight;
Which way will the water spill?
(Originally in Hebrew:

סטודנט אחד ושמו חיים / בעט בטעות בדלי מים / הדלי נשפך / הסטודנט ברח / לאן יזלו המים?)

The obvious answer is: Opposite the gradient! Indeed, if you ever find yourself stranded on top of a mountain in a bizarre n -dimensional landscape, then the fastest way to go downhill is to go opposite the gradient.

The gradient descent algorithm is based on this basic observation, and indeed all it does (essentially) is take small steps in the direction opposite the gradient \nabla f(x) , until you find a minimizer x^* of f . You can think of this like gently rolling a ball downhill:

This assumes, of course, that f is differentiable, so it’s possible to even speak of “the gradient \nabla f of f “. There are things you can do in case f is not differentiable, but we’ll stick to this assumption for the rest of this post.

The algorithm runs in steps, and at time step t remembers the current position of the ball, x_t . The next position is obtained by taking a small step in opposite the gradient:

x_{t+1} = x_t + \eta \nabla f(x).

I haven’t stated precisely in what way these steps are taken, and indeed this is the entire art of gradient descent, but that’s the overall idea. However, the keen-eyed among you will already notice a fatal flaw even in this very general program: A function can have multiple local minima, and going downhill can cause us to become stuck in one of these pits and miss out on the true minimum:

There are various ways to try and fix this (for example, we can repeat the algorithm many times from different starting points), but by far the easiest one is to assume that f has only one minimum. Now we’re guaranteed that descending along the gradient will give us a global minimum. Right? Well, no:

Indeed, in flat landscapes, going downhill goes nowhere (but it goes there extremely fast), and multi-dimensional flat landscapes can be pretty tricky. So just to be on the safe side, we’re going to assume that the function f is convex. i.e for all x,y\in \mathbb{R}^n and all t\in \left[0,1\right] ,

f((1-t)x + ty) \leq (1-t)f(x) + t f(y).

This is often stated as “the graph of a convex function is always below its chords” (see image above), and rules out the possibility of having such plateaus which are the minimum. Convex functions have only one minimal value, and as we’ll see, also convexity ensures that the various gradient descents indeed reach this minimum (and quickly so!). So from now on and until the end of time, we’ll always assume that f is convex.

Convexity lets us bound how close a point x is to being a minimizer x^* . Define the “deficit” (or should I say, “\delta eficit”) by

\delta(x) = f(x) - f(x^*).

Since x^* is the minimizer of f , we always have \delta(x) \geq 0 , and our goal to minimize f(x) translates into a goal to find x such that \delta(x) is really really small. To this end, we use a common alternative description of convexity, which says that a convex function is always above its tangent:

Mathematically, this is written as

f(x) - f(y) \leq \nabla f(x) \cdot (x - y).

To see this, rewrite the convexity assumption as

f(x +t(y-x)) \leq (1-t)f(x) + t f(y).

Subtract f(x) from both sides and divide by t\|y-x\| to get

\frac{f(x +t(y-x))-f(x)}{t\|y-x\|} \leq \frac{f(y)-f(x)}{\|y-x\|}.

Sending t \to 0 , the left-hand side becomes the directional derivative of f in direction y-x at the point x . Multiplying by -\|y-x\| then gives the result, since \nabla f(x) \cdot (y-x) is exactly equal to \|y-x\| times the directional derivative of f in direction y-x .

Using this result with y = x^* , we get a bound on the deficit:

\delta(x) \leq \nabla f(x) \cdot (x - x^*) = -\nabla f(x) \cdot (x^* - x).

Gradient flow

Let’s first show that going in the direction opposite the gradient indeed gets you to the minimum. Here, by “going” I actually mean “flowing”, i.e we’ll treat the position of our ball a function of continuous time, and update it according to a differential equation. This is a very physicsy thing to do, and is something that computers have a hard time with, since unlike mathematicians, they cannot make infinitely many infinitesimally small steps in a single time moment. Indeed, a computer trying to solve a differential equation will have to discretize time in some way, and this, as we’ll see, introduces errors. But for now, the Platonic ideal of gradient flow will give us intuition as to how going downhill eventually gets us to the bottom (geez, when you put it this way, it really sounds obvious).

Formally, we say this. Let x_0\in \mathbb{R}^n . The gradient flow is a function x(t) which starts at x(0) = x_0 and is governed by the differential equation:

\frac{d}{dt}x(t) = - \nabla f(x(t)).

The hope is that if we “run” the differential equation for a long enough time \tau , then x(\tau) will be very close to the minimizer x^* . This is quite true. It can be seen by example in the following two images, which show the path of a gradient flow on both the surface plot and heatmap of a two-dimensional function:

It can also be seen by the following theorem:

Theorem: For every time \tau > 0 ,

\delta(x(\tau)) \leq \frac{\|x_0 - x^*\|^2}{2\tau }.

Proof: Let’s look at the distance between x(t) and x^* . The rate of change of the square of that distance is

\frac{d}{dt}~\frac{1}{2}\|x(t) - x^*\|^2 = \left(\frac{d}{dt}(x(t) - x^*)\right)\cdot \left(x(t) - x^* \right) .

Since x^* is a constant and does not depend on time, the left term in the inner product is just equal to -\nabla f(x(t)) . So

\frac{d}{dt}~\frac{1}{2}\|x(t) - x^*\|^2 = -\nabla f(x) \cdot (x(t) - x^*) \leq -\delta(x(t)),

where the last inequality used the inequality for \delta(x) that we obtained by convexity of f . This is a nice intermediate result: We would have really liked it if we could flow directly in the direction x^* - x(t) . Instead, all we can do is flow in the direction of -\nabla f(x(t)) , but as the above equation shows, these two vectors are positively correlated, since \delta(x) is always positive. So while flowing along the gradient won’t necessarily take us on the shortest route towards the minimizer, at least it’s always going in the right direction.

Anyways, now we can rearrange and integrate both sides from 0 to \tau :

\int_0^\tau \delta(x(t))dt \leq  -\int_0^\tau \frac{d}{dt}~\frac{1}{2}\|x(t) - x^*\|^2 dt.

The right-hand side is just equal to \frac{1}{2}\|x(0)-x^*\|^2 - \frac{1}{2}\|x(\tau)-x^*\|^2 \leq \frac{1}{2}\|x_0 - x^*\|^2 , and dividing by \tau gives

\frac{1}{\tau} \int_0^\tau \delta(x(t))dt \leq \frac{\|x_0 - x^*\|^2}{2\tau}.

Now, \delta(x(t)) is a decreasing function of t ; this can be seen, for example, by differentiating:

\frac{d}{dt}\delta(x(t)) = \frac{d}{dt} (f(x(t))-f(x^*)) = \nabla f(x(t)) \cdot \frac{d}{dt}(x(t))= - \|\nabla f(x(t)) \|^2 \leq 0.

Thus \delta(x(\tau)) \leq \frac{1}{\tau} \int_0^\tau \delta(x(t))dt , which proves the theorem. □

This process actually gave us another point x for which the deficit \delta is small: Since \delta(x) is a convex function of x , we always have

\delta\left(\frac{1}{\tau} \int_0^\tau x(t)dt\right) \leq \frac{1}{\tau}\int_0^\tau \delta(x(t))dt,

so the point x = \frac{1}{\tau}\int_0^\tau x(t)dt also satisfies the inequality stated in the theorem. This trick will come in handy in the future, where we won’t be able to guarantee that the last position is the best one.

The term \|x_0 -x^*\|^2 is sort of necessary: If we happen to start very far off from the minimizer, we’re going to have to flow quite a lot. But it’s really not too bad, especially if we assume as before that we know that the minimizer is at some distance R from the initial point; if we want an \varepsilon error in the function’s value, we only need to run the gradient flow for R^2 / 2\varepsilon time.

Note: When time is continuous, it can always be rescaled: For example, instead of the original differential equation, we could have mandated that x(t) evolve according to \frac{d}{dt}x(t) = -K \nabla f(x(t)) , where K is as large a constant as we want; this corresponds to flowing “faster” by a factor of K . But as mentioned above, this continuous-time example is just supposed to serve as intuition; the next algorithm, which is indeed discrete and can actually be implemented by a real bit-chomping computer, won’t indulge in such luxuries.

Gradient descent

The classical way of solving differential equations is to discretize time. Quite informally, instead of writing \frac{d}{dt}x(t) = - \nabla f(x(t)) , we write dx(t) = - \nabla f(x(t))dt , and instead of the infinitesimal dt , we take a small time step which we denote by \eta . Our algorithm then reduces to iterating the following difference equation:

x_{t+1} = x_t - \eta \nabla f(x_t)

Note: There are fancier ways of numerically solving differential equations, such as Runge-Kutta, which will arguably (and probably provably) produce more accurate discretization results. But I favor this simple single-step method, because its analysis is easier (and more importantly, because it’s the method that was covered in the summer school I attended…)

Choosing the step size \eta is an important part in the design of the algorithm. The gradient changes from place to place, so if \eta is too large, we can easily overstep and miss the optimal, flow-induced trajectory, giving a less-than-optimal trajectory:

Doing this can be dangerous, potentially even forcing the algorithm into an infinite loop. For example, consider the very simple case of the one-dimensional f(x) = x^4/4 . The negative gradient in this case is just - \nabla f(x) = -x^3 , so the difference equation becomes

x_{t+1} = x_t - \eta x_t^3.

The minimum of this function is of course x^* = 0 . Suppose that at some point we just happen to arrive at x_t = \sqrt{2/\eta} . Then the next position position will be

x_{t+1}= \sqrt{2/\eta} - \eta \left(\sqrt{2/\eta}\right)^3 = \sqrt{2/\eta} - 2 \sqrt{2/\eta} = -\sqrt{2/\eta} = -x_t,

and from this point on the algorithm will just bounce back and forth between \sqrt{2/\eta} and -\sqrt{2/\eta} , missing the minimum 0 entirely! Overshooting the minimizer will happen less often if \eta is small.

Here is a catastrophic failure of the algorithm for the example we used so far (the image shows 100 steps of gradient descent!):

On the other hand, too small an \eta means that we will need to perform many, many steps in order to get close to x^* . Viewed another way, suppose that we decide on the number of steps that we are going to run the algorithm beforehand, in advance, and only then choose \eta . If we choose \eta too small, we aren’t going to move very far from the original starting position.

To conclude this discussion: Choosing \eta too large can cause overstepping, meaning that we take steps which are too large in unwanted directions which do not directly lead us towards x^* . Choosing \eta too small gives a better approximation of the continuous differential equation, but may lead to super-slow convergence, so the final x_T is not even close to x^* . The gradient descent algorithm has to balance between these two types of errors: The “discretization error”, and the “forget the initial condition error”. The formal analysis is given below:

Theorem: Let f be convex and L -Lipschitz and let \varepsilon > 0 . With proper choice of \eta , the gradient descent algorithm can find a point x so that the deficit is small, i.e \delta(x) \leq \varepsilon , in just T = \frac{R^2 L^2}{\varepsilon^2} steps.

Proof: Similarly to the gradient flow analysis, let’s look at how the distance between x_t and x^* improves when the algorithm took one step. By the cosine law,

\| x^* - x_{t+1}\|^2 = \| x^* - x_{t} + \eta \nabla f(x_t)\|^2 = \| x^* - x_{t}\|^2 + \|\eta \nabla f(x_t) \|^2 + 2(x^* - x_t)\cdot (\eta \nabla f(x_t)).

Rearranging, we have

\| x^* - x_{t+1}\|^2 - \| x^* - x_{t}\|^2 = \|\eta \nabla f(x_t) \|^2 + 2(x^* - x_t)\cdot (\eta \nabla f(x_t)).

The term on the left-hand side tells us by how much the distance to x^* improved; we want this quantity to be as negative as possible. Now, the first term on the right-hand side is always positive, and works against us; but the second term can make up for it, for as we know, \nabla f(x_t) is negatively correlated with x^* - x_t . To see how this plays out, multiply by \left(-1\right) and rearrange again to see how large this correlation is:

\nabla f(x_t) \cdot (x_t - x^*) = \frac{\| x^* - x_{t}\|^2 - \| x^* - x_{t+1}\|^2}{2\eta} + \frac{\eta}{2}\|\nabla f(x_t)\|^2.

Suppose that we run the algorithm for T steps. Summing up the above expression for all times t=0,\ldots,T-1 , the distances \|x^* - x_t\|^2 form a telescoping sum, and after a division by T we get

\frac{1}{T}\sum_{t=0}^{T-1}  \nabla f(x_t) \cdot (x_t -x^*) = \frac{\| x^* - x_{0}\|^2 - \| x^* - x_{T}\|^2}{2\eta T} + \frac{\eta}{2T} \sum_{t=0}^{T-1}\|\nabla f(x_t) \|^2.

The first term on the right-hand side is certainly smaller than \| x^* - x_{0}\|^2/2\eta T , which itself is smaller than R^2 / 2\eta T , and since f is L -Lipschitz, the second term is smaller than \eta L^2 /2 . Thus

\frac{1}{T}\sum_{t=0}^{T-1}  \nabla f(x_t) \cdot (x_t -x^*) \leq \frac{1}{2}\left( \frac{R^2}{\eta T} + \eta L^2\right).

So far we haven’t even used the convexity of f , and the sums were valid for any Lipschitz function. But now, define x = \frac{1}{T} \sum_{t=0}^{T-1}x_t . Then by the basic definition of convexity,

f(x) - f(x^*) = f\left(\frac{1}{T} \sum_{t=0}^{T-1}x_t\right) - f(x^*) \leq \frac{1}{T}\sum_{t=0}^{T-1}f(x_t)-f(x^*).

But we also saw that for convex functions, f(x) - f(y) \leq \nabla f(x) \cdot (x - y) for all x and y . Thus each term in the right-hand side is smaller than \nabla f(x_t) \cdot (x_t -x^*) , and combining this with the expression we got before, we reach the stunning result:

\delta(x) = f(x) - f(x^*) \leq \frac{1}{2}\left( \frac{R^2}{\eta T} + \eta L^2\right).

As if by divine prophecy, we see here exactly the two types of errors that we promised we would see: The first, R^2 / \eta T , comes from having a step size too small to get from x_0 to x^* in any reasonable time. You can see that it grows smaller as both T and \eta grow larger. The second, \eta L^2 , stems from overshooting. As per our dictated intuition, the larger the step size \eta , the larger the error; but the error also grows with the Lipschitz constant of f . In other words, if f can have very steep ridges and valleys, i.e it changes too quickly over short distances, these overshoots will cause us to miss the mark.

(The keen-eyed reader will also note another divine prophecy – the algorithm did not eventually take x = x_T (we have almost no information on how good x_T is), but rather used convexity and took x to be the average of all points visited by the algorithm. Now, by the pigeonhole principle, one of the sample points must be better than the average, so in principle if you can easily calculate f , you can take the best x_t . But this just serves to highlight an interesting feature of gradient descent: It only needs to calculate \nabla f(x) in order to work, and not f(x) itself!)

Alright! All that’s left to do is optimize over \eta . Choose \eta = \frac{R}{L \sqrt{T}} to get

\delta(x) \leq \frac{RL}{\sqrt{T}}.

Setting the \delta eficit to \varepsilon and rearranging gives

T \leq \frac{R^2 L^2}{\varepsilon^2}

as promised by the theorem. □
That’s it for the basic gradient descent algorithm. It’s actually quite decent. The assumptions we made – that f is Lipschitz and that you know its Lipschitz constant, for example – aren’t that far-fetched, and in the worst case you can always guess L and/or R and run the algorithm multiple times. It has the curious property that you don’t even need to be able to calculate f to approximate its minimum, but rather only its gradients! This is nice in applications where computing f is hard but computing \nabla f is relatively easy. Finally, in terms of runtime, I think that making R^2 L^2/4 \varepsilon^2 gradient calculations is much better than making the (R/\varepsilon)^n function calculations of our naive brute-search algorithm almost any day (well, any day that n>2 ).

The brutal efficiency of gradient descent actually caught me off guard when I first saw it. Nowhere at all does the approximation bound or the algorithm itself explicitly mention the dimension; it works just as well for n=2 and n=1000 . In fact, it even works for infinite dimensional spaces, such as function spaces! assuming you have some practical way to calculate gradients and perform addition in those cases, of course. But more on that will come (much) later. For now, let’s see how gradient descent can be tweaked both to work in harsher, less-optimistic situations, and to converge faster in smoother, more forgiving conditions.

Gradient descent with boundaries

Under the Universal Declaration of the United Nations, freedom of movement is a fundamental human right. However, this is not always the case for the points x_t in the gradient descent algorithm. Often we are interested in minimizing a function not over the entire domain \mathbb{R}^n , but rather over a smaller set; for example, we might wish to find the minimizing vector of length \|x\| \leq 1 , or a vector whose sum of coordinates is \sum_{i=1}^{n}x_i = 0 , or a vector which satisfies a prescribed list of linear inequalities, Ax \geq \mathbf{b} . These conditions are called constraints. Alas, the original gradient descent algorithm has no knowledge of these constraints, and eagerly carries the points x_t outside the allowed set.

Luckily, this problem has an easy solution, in case the constraint set K is convex, i.e for all points x,y \in K and all t \in \left[0,1\right] , the point\left(1-t\right)x + ty is also in K .

This demand makes sense, since we are dealing with convex functions anyway, and often holds in real life situations (the three constraint examples given above all define convex sets. Actually, they define closed convex sets, and indeed we’ll assume that K is closed).

The solution is straightforward: After each gradient step, we project the point x_{t+1} back into the set K . By projection, we mean that x is sent to the point y in K that is closest to x . More formally, for any set K \subseteq \mathbb{R}^n , the distance between K and x is defined as

d(x,K) = \inf_{y\in K}\|x-y\|.

If K is closed then the infimum is actually a minimum (pretty much by definition), so there exists a point y_0 \in K such that d(x,y_0) \leq d(x,y) for all y\in K ; and if K is convex, this point y_0 is unique (if there were two points of minimal distance, say y_0 and y_1 , then the midpoint between them would be closer to x than either!). We then define the projection onto K by P_K(x) = y_x = \mathrm{argmin}_{y\in K}\|x-y\| .

Note that convexity is indeed important for uniqueness:

Our new, boundary-respecting gradient descent algorithm then simply applies P_K at each step:

x_{t+1} = P_k(x_t - \eta \nabla f(x_t)).

Now, you might think that with another layer wrapping the x_t and gradient, the analysis of this new algorithm will be harder, or the convergence slower. But this happens not at all; in fact, projecting onto K makes us converge faster!

Claim: For any x \in \mathbb{R}^n and any z\in K , \|P_K(x)-z\| \leq \|x-z\| .

Proof: If x\in K then P_K(x) = x and there is nothing to prove, so we assume that x is not in K . Since we are only interested in distances between points, we may translate and scale both the convex body K and x together whichever way we want without affecting the relation between the distances in the claim, so for our general convenience, assume that P_K(x) = 0 and that \|x\| = 1 . This means that the distance d(x,K)=1 under our scaling. The vector x naturally defines a hyperplane H by H = \{y\in \mathbb{R}^n \ x\cdot y = 0\} . Like all hyperplanes, this hyperplane divides \mathbb{R}^n into two halfspaces: the set of all points with scalar product greater than 0 , and set of all points with scalar product less than 0 . As it turns out, the entire set K is contained in this latter halfspace, i.e for all z \in K , x \cdot z \leq 0 .

To see this, Let v_1 = x and complete v to an orthonormal basis B=\{v_1,\ldots,v_n\} . Any point y = (\alpha_1,\ldots,\alpha_n) written in this basis which has positive scalar product with x must have \alpha_1 > 0 . Since K is convex and 0 \in K , if there exists such a y \in K , then for all t \in \left[0,1\right] the point

y_t = (t\alpha _1, t\alpha_2,\ldots,t\alpha_n)

is in K as well. The squared distance between y_t and x is given by

\|x -y_t \|^2 = \left(1 - t \alpha_1\right)^2 + \left(t\alpha_2 \right)^2 + \ldots + \left(t\alpha_n \right)^2

= 1 - 2t\alpha_1 + (t\alpha_1)^2 + \left(t\alpha_2 \right)^2 + \ldots + \left(t\alpha_n \right)^2.

For t small enough (but greater than 0 ), all the squared terms are negligible compared to the term 2t\alpha_1 ; in fact, we can take t so that

\|x -y_t \|^2 \leq 1- t\alpha_1.

But this contradicts the fact that the distance between x and K is 1 ! Hence all z \in K must have a negative v_1 component, and so their distance from x must be greater than 1 . □

Applying P_K onto x_{t+1} therefore takes us closer to every point in K , and in particular, it takes us closer to the minimizer x^* . Plugging this fact into the analysis of the previous section, we get exactly the same error bounds as in the unconstrained gradient descent algorithm. In other words, as the locals around here like to say, gradient descent eats convex constraints without salt.

(Of course, we implicitly assume here that we can calculate the projection P_K easily. Sometimes this is easy (for example, when you are dealing with linear constraints, or projecting onto spheres), but I can imagine that in some applications the set K is given rather implicitly, and you’d have to go deeper into this rabbit hole to get good approximations.)

Gradient descent without knowing the time

One complaint that often gets filed in the deep archives of the Bureau of Gradient Descent is that \eta is given by

\eta = \frac{R}{L \sqrt{T}}.

The problem with this expression is that it requires knowing the number of steps T that the algorithm is going to take in advance. This is fine in some cases – for example, if you really just want an error guarantee of \varepsilon , take T to be T = \frac{R^2 L^2}{4\varepsilon^2} and let her rip. But sometimes it’s good to have a more adaptive algorithm, that runs and runs and runs and keeps getting better and better until you tell it to stop. In this case you do not know T in advance, so what do you do with \eta ?

Luckily, there is a very simple fix. Instead of having the step-size \eta be constant throughout each step, use a time-dependent step-size:

x_{t+1} = x_t - \eta_t \nabla f(x_t).

When you plug this in the gradient descent algorithm, you get

\frac{1}{T}\sum_{t=0}^{T-1}  \nabla f(x_t) \cdot (x_t -x^*) = \frac{1}{T} \sum_{t=0}^{T} \frac{\| x^* - x_{t}\|^2 - \| x^* - x_{t+1}\|^2}{2\eta_t T} + \frac{1}{2T} \sum_{t=0}^{T-1}\eta_t\|\nabla f(x_t) \|^2.

The more pretentious among us might say that it’s an entire art form in itself to pick the right \eta_t , but actually in this particular simple case, choosing the right \eta is not that difficult. Here are two relatively-easy-to-analyze choices:

  • In the first, take \eta_t = R/\left(L\sqrt{t+1}\right) .
  • In the second, set t_i = 2^i , and take \eta_t = \frac{R}{L}\cdot 1/\sqrt{t_i} if t_i \leq t < t_{i+1} . In words, your \eta_t is a step-function approximation of 1/\sqrt{T} , with longer and longer steps as time goes on.

The calculations which show that these kind of things work aren’t that friendly, but are certainly doable if you rely on the proof of the basic gradient descent algorithm, so I won’t repeat them here. In any case it should be noted, that in some practical, real-world, actual, useful, non-theoretical applications, the total runtime T is not that large anyways since empirically the algorithm may converge very fast. Also, in reactive or real-time systems, who has the patience to run a million steps of gradient descent every time they want to do something?

Gradient descent with randomness

Not surprisingly, the gradient descent algorithm assumes that we can calculate the gradient, and if we want the algorithm to really, actually, work in practice, we should also be able to calculate the gradient quickly. But real-life isn’t always nice that way, and many times calculating \nabla f can be prohibitive. One real-world example is the parameter optimization of artificial neural networks. I won’t go into the details in this post (for that, you can watch the amazing “neural network” video series by 3blue1brown, found here), but the gist is that neural networks are created by “training” them, a process that involves minimizing a cost function f which depends on the neural weights w of the network. This cost function usually measures the error of the network on a “training set” of samples S :

f(w) =  \sum_{s\in S} \mathrm{error}(w,s),

where \mathrm{error}(w,s) is some function that measures how well the network with weights w performed on the input s . The usual paradigm in artificial neural networks concerning the size of the training set is “the more the merrier, and keep ’em coming!” But if there are hundreds of thousands, or even billions of samples, calculating the gradient of the cost function even at a single point can put a large hole in your time-wallet.

Note: The loss functions in neural networks usually aren’t convex. That really doesn’t stop people from doing gradient descent anyways.

Whatever will the artificial intellimancers do? One common tactic that all of us frequently use when faced with lots of work but little time to do it (apart from procrastinating), is to just not do all the work. So the intellimancers asked themselves, “Instead of going over the entire sample set S in the above example, is it possible to approximate the gradient by going over just a small portion of it?”

The answer is, of course, a resounding yes! The “gradient” in “gradient descent” can actually be replaced by any vector, as long as on average that vector is equal to the gradient. Doing this is often called “stochastic gradient descent”, and it deserves its own theorem.

Theorem: For t=0,1,\ldots , define the difference process

x_{t+1} = x_t + \eta g_t,

where g_t are some (random) vectors which can possibly depend on x_t . Let f be a convex function, and assume that

\mathbb{E}\left[g_t \mid x_t \right] = \nabla f\left(x_t\right)


\mathbb{E} \|g_t\|^2 \leq B^2.

Then in expectation, gradient descent works! More precisely, after T = \frac{R^2B^2}{\varepsilon^2} time steps, the stochastic gradient descent algorithm finds a point x such that the expected error \mathbb{E}\left[\delta\left(x\right) \right] is smaller than \varepsilon .

Proof: The calculations are actually almost exactly the same as the ones we did for showing that vanilla gradient descent works. We again look at how the distance between x_t and x^* diminishes as a function of time:

\| x^* - x_{t+1}\|^2 = \| x^* - x_{t} + \eta g_t\|^2 = \| x^* - x_{t}\|^2 + \|\eta g_t \|^2 + 2\left(x^* - x_t\right)\cdot \left(\eta g_t\right).

Rearranging, we once again get

\frac{1}{T}\sum_{t=0}^{T-1}  g_t \cdot (x_t -x^*) = \frac{\| x^* - x_{0}\|^2 - \| x^* - x_{T}\|^2}{2\eta T} + \frac{\eta}{2T} \sum_{t=0}^{T-1}\|g_t \|^2.

So far we haven’t done anything new, and in fact haven’t even used any property of g whatsoever; the above formula is true for any set of vectors g_t . But when take the expectation, everything falls into place: On the right-hand side, since \mathbb{E}\left[\|g_t\|^2\right] \leq B^2 , the second term becomes

\mathbb{E}\left[\frac{\eta}{2T}  \sum_{t=0}^{T-1}\|g_t \|^2\right] \leq \frac{\eta}{2T}  \sum_{t=0}^{T-1}B^2 = \frac{\eta B^2}{2}.

On the left-hand side, the expectation of each term can be calculated using conditioning:

\mathbb{E}\left[g_t \cdot \left(x_t -x^*\right)\right] = \mathbb{E}\left[\mathbb{E}\left[g_t \cdot \left(x_t -x^*\right) \mid x_t \right]\right]

= \mathbb{E}\left[\mathbb{E}\left[g_t \mid x_t \right] \cdot \left(x_t -x^*\right)\right] = \mathbb{E}\left[\nabla f\left(x_t\right) \cdot \left(x_t -x^* \right)\right].

After doing this, we’ve basically eliminated all the randomness in g_t by putting things inside an expected value, and we get the now-familiar equation

\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}  \nabla f(x_t) \cdot (x_t -x^*)\right] \leq \frac{1}{2}\left( \frac{R^2}{\eta T} + \eta B^2\right).

Note that we still need to take expectation in the left-hand side, since x_t is a now a random variable itself. Also note that the role of L has been taken by B . You might think that this removes the Lipschitzness assumption from f , but it doesn’t really; in fact, if g_t is supposed to be a stochastic approximation of \nabla f\left(x_t\right) , then we really should expect B to be greater than L (intuitively, any time that \|g_t\| \leq \|\nabla f\left(x_t\right)\| , we’d need to have a balancing case where \|g_t\| \geq  \|\nabla f\left(x_t\right)\| , but this works out in L ‘s favor) . Anyway, we can now continue as we did before: By convexity for x = \sum_{t=0}^{T-1}x_t ,

\mathbb{E}\left[f\left(x\right) - f\left(x^*\right)\right] = \mathbb{E}\left[f\left(\frac{1}{T} \sum_{t=0}^{T-1}x_t\right) - f\left(x^*\right)\right] \leq \mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}f\left(x_t\right)-f\left(x^*\right)\right],

and also by convexity, the right-hand side is equal to \mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}  \nabla f\left(x_t\right) \cdot \left(x_t -x^*\right)\right] . So in total,

\mathbb{E}\left[\delta\left(x\right)\right] = \mathbb{E}\left[f(x) - f(x^*)\right] \leq \frac{1}{2}\left( \frac{R^2}{\eta T} + \eta B^2\right).

One again, choose \eta = \frac{R}{B \sqrt{T}} to get

\mathbb{E}\left[\delta\left(x\right)\right] \leq \frac{RB}{ \sqrt{T}}.

Setting the expected \delta eficit to \varepsilon and rearranging gives

T \leq \frac{R^2B^2}{\varepsilon^2}.

The keen-eyed skeptics among you should slightly raise an eyebrow, and wonder about the usefulness of such a bound. After all, the algorithm only says that the expectation of the error is small; how can you know whether an individual run is good or not? How can you be sure that, while the expectation is small, you won’t have to occasionally run the algorithm for a tremendous amount of time?

Luckily for us, the quantity \delta\left(x\right) is positive, so we can apply Markov’s inequality to it: For every \alpha > 0 ,

\mathbb{P}\left[\delta\left(x\right) \geq \alpha \right] \leq \frac{\mathbb{E}\left[\delta\left(x\right)\right]}{\alpha} \leq \frac{\frac{RB}{ \sqrt{T}}}{\alpha}.

Choosing \alpha = \frac{2RB}{ \sqrt{T}} , we get that

\mathbb{P}\left[\delta\left(x\right) \geq \frac{2RB}{ \sqrt{T}} \right] \leq \frac{1}{2}.

You can then make sure that you get an actual good x with low error by running the stochastic gradient descent algorithm in parallel, using independent randomness:
  1. Set T = \frac{R^2B^2}{\varepsilon^2} , so that the expected error is smaller than \varepsilon .
  2. Run K independent copies of stochastic gradient descent, each for T time steps, giving final points x^{\left(1\right)}, \ldots, x^{\left(k\right)} .
  3. Return \min\{f\left(x^{\left(1\right)}\right),\ldots, f\left(x^{\left(k\right)}\right)\} .

Each x^{\left(i\right)} has a probability of at least 1/2 of giving an error less than 2\varepsilon , so the probability that all of the x^{\left(i\right)} ‘s suck is smaller than 2^{-K} . So in order to get an error less than \varepsilon with probability at least 1-p , you only need to (collectively) run the algorithm for \frac{R^2B^2}{\varepsilon^2} \cdot \log_2\left(1/p\right) times. Not too bad!

Exercise for the reader: The above algorithm requires you to calculate f\left(x^{\left(i\right)}\right) , which means being able to calculate f . This is something that the deterministic gradient algorithm did not require. Can you find a way to get, with exponentially high probability, a point x with good error bound, without ever calculating f on any point?

By the way, after all this talk and proof, we can finally see the solution to the gradient computation problem in artificial neural networks. Instead of calculating the entire gradient of the cost function which goes over the entire training set S , pick at random some small subset of it, and calculate the gradient of the cost function only on those elements. In expectation, the two behave the same, and if the cost function is not naughty, the bound B will not be too large. Then just apply the above theorem. Actually, for B not to be large, you might need to take quite a lot of samples. There are techniques for combining the deterministic gradient descent and the stochastic one together, but I won’t talk about them in this post.

It’s been a while since we last saw pretty pictures, so here is a simulation showing the difference between ordinary gradient descent and stochastic gradient descent (the vectors g_t are obtained from \nabla f\left(x_t\right) by simply adding a Gaussian vector as noise).

Gradient descent with curvature

When somebody says “convex function”, what pops up in your head? If you’re anything like me (my condolences), it’s probably something like this:

That’s fine, but you will do well to remember that even a linear function is convex:

The difference between the two images, of course, is curvature, and sometimes it makes sense to distinguish between functions which have curvature (such as x^2 or e^x ) and functions which do not, such as 3x+1 . Indeed, by definition, a convex function is always found above its tangent, and in this sense a linear function is the least-convex of all convex functions: The inequality is so poor, it’s actually an equality, since the function is its own tangent. This calls for a stronger notion of convexity, which we will aptly call “strong convexity”.

A function f is called \alpha -strongly convex if we can write

f(x) = g(x) + \frac{\alpha}{2}\|x\|^2,

where g itself is a convex function. In words, a strongly-convex function is at least as curved as a parabola. In particular, this means that the function is quadratically greater than its tangent:

Claim: If f is an \alpha -strongly convex function, then

f(y) \geq f(x) + \nabla f(x)\cdot (y-x) + \frac{\alpha}{2}\|x-y\|^2.

Proof: Luckily we already did all the work for proving this! Rearranging the definition of strong convexity, we have that

g(x) = f(x) - \frac{\alpha}{2}\|x\|^2

is a convex function. We already proved earlier that convex functions are above their tangent, so

g(x) - g(y) \leq \nabla g(x) \cdot (x-y).

The gradient of g at x is just equal to \nabla f(x) - \alpha x , and plugging this in and rearranging yields

f(y) - f(x) \geq \nabla f(x) \cdot (y-x) - \frac{\alpha}{2}\|x\|^2 + \frac{\alpha}{2}\|y\|^2 + \alpha x \cdot (x-y).

The last term on the right-hand side is just \alpha \|x\|^2 - \alpha x\cdot y , so

f(y) - f(x) \geq \nabla f(x) \cdot (y-x) + \frac{\alpha}{2}\left(\|x\|^2 + \|y\|^2 - 2 (x \cdot y)\right).

By the cosine law, this is exactly what we want! □

Being strongly convex can strongly help us when performing gradient descent, and the stronger, the better. Intuitively, the strong-convexity condition forces the gradient to be very large whenever we are far away from the minimum. This means we can take \eta to be relatively small (so that we don’t overshoot when near the minimizer), and still be able to take large steps when we are far away from from the minimizer. With any luck (or, in this case, prior knowledge of the proofs), this will lead to faster convergence, so that less steps are needed in the algorithm to achieve the same amount of accuracy.

To see just how useful curvature can be, let’s consider first gradient flow, just like we did for the unstrong case. We start at some x(0) = x_0 \in \mathbb{R}^n , and let x(t) evolve according to the differential equation

\frac{d}{dt}x(t) = -\nabla f(x(t)).

As before, a good way to analyze this equation is to look at the square distance of x(t) from the true minimizer x^* .

\frac{d}{dt}~\frac{1}{2}\|x(t) - x^*\|^2 = \left(\frac{d}{dt}(x(t) - x^*)\right)\cdot \left(x(t) - x^* \right) .

As before, since x^* is a constant and does not depend on time, the left term in the inner product is just equal to -\nabla f(x(t)) . So

\frac{d}{dt}~\frac{1}{2}\|x(t) - x^*\|^2 = -\nabla f(x(t)) \cdot \left(x(t) - x^* \right) .

Now we can use the above claim about strong-convexity to bound the size of the scalar product \nabla f(x(t)) \cdot \left(x(t) - x^* \right) ; after slight rearranging, we get

\frac{d}{dt}~\frac{1}{2}\|x(t) - x^*\|^2 \leq -\delta\left(x(t)\right) -\frac{\alpha}{2}\|x(t) -x^*\|^2.

Since \delta(x) is always positive, we arrive at a very pleasant differential inequality:

\frac{d}{dt}~\|x(t) - x^*\|^2 \leq -\alpha \|x(t) -x^*\|^2.

In particular, the quantity \|x(t) - x^*\|^2 = \|x_0 - x^*\|^2 decays faster than if there was an equality, which corresponds to a decreasing exponential!

\|x(t) - x^*\|^2 \leq \|x_0 - x^*\|^2 \cdot e^{-\alpha t} \leq R^2 e^{-\alpha t}.

Gradient flow converges exponentially quickly to the minimizer, which, if you ask me, is not too shabby. And while this only talks about the distance to the optimum, we can always assume that f is L -Lipschitz, so being \varepsilon -close to the minimizer means being only \varepsilon L close the actual minimum. Thus

\delta(x(\tau)) \leq L R^2 e^{-\alpha \tau},

which is much, much better than the R^2/\tau estimate we had for the non-curvy case.

Unfortunately, not all is perfect in the land of the discrete. We already saw earlier that gradient descent, despite its unbearable adorableness, suffers from discretization errors, and in the non-curved case, only got a 1/\sqrt{T} \delta eficit bound, whereas the cool and continuous gradient flow got a 1/T \delta eficit. Sure, that’s only a square root difference, which in terms of exponentials doesn’t really count for anything, but so far nothing ensures us that gradient descent with curvature will be as good as gradient flow with curvature. Luckily, as always, the calculations aren’t too difficult.

Theorem: Let f be \alpha -strongly convex and L -Lipschitz, and let \varepsilon > 0 . With the proper choice of \eta_t (which will now really depend on t ), the gradient descent algorithm can find a point x so that the deficit is small, i.e \delta(x) \leq \varepsilon , in just T \leq \frac{2L^2}{\alpha \varepsilon} +1 steps!

Before I give the proof, a couple of soothing words are in order, since I’m sure that you are outraged: How can it be that the gradient flow improved from an error of 1/T to an error of e^{-T} when introducing curvature, but gradient descent only improved from 1/\sqrt{T} to 1/T ?! Well, such is life. The not-so-spectacular increase stems from the fact that while having large curvature makes sure that we take (relatively) big steps when we are far away from the minimum, it doesn’t really say anything about what happens when we are close to the minimum. In that case, there is still always a looming fear of overshooting due to discretization, which means that \eta_t needs to be small. This of course doesn’t happen in the continuous gradient flow, which can enjoy the extra-quick velocity unabashed and unabated. Not all is lost for the discrete gradient descent, however, and it turns out that it is possible to get exponentially good convergence, with just one extra assumption. But that’s already peeking into the future. For now, let’s see the proof, which is only a tad more clever than the original gradient descent analysis.

Proof: Way back when we were younger, we found out that

\nabla f(x_t) \cdot (x_t - x^*) = \frac{\| x^* - x_{t}\|^2 - \| x^* - x_{t+1}\|^2}{2\eta_t} + \frac{\eta_t}{2}\|\nabla f(x_t)\|^2.

Since f is L -Lipschitz, this immediately reduces to

\nabla f(x_t) \cdot (x_t - x^*) = \frac{\| x^* - x_{t}\|^2 - \| x^* - x_{t+1}\|^2}{2\eta_t} + \frac{\eta_t L^2}{2}.

Now we use the fact that f is \alpha -strongly convex: The left-hand side is bounded below by

\nabla f(x_t) \cdot (x_t - x^*) \geq f(x_t) - f(x^*) + \frac{\alpha}{2}\|x_t -x^*\|^2,

and so

f(x_t) - f(x^*) \leq \left(\frac{1}{2\eta_t} - \frac{\alpha}{2} \right) \| x^* - x_{t}\|^2 - \frac{1}{2\eta_t} \| x^* - x_{t+1}\|^2 + \frac{\eta_t L^2}{2}.

It’s time to choose \eta . Unlike the vanilla case, we’ll take \eta_t to depend on time, and set \eta_t = \frac{2}{\alpha(t+2)} . Plugging this choice in,

f(x_t) - f(x^*) \leq \left(\frac{\alpha(t+2)}{4} - \frac{\alpha}{2} \right) \| x^* - x_{t}\|^2 - \frac{\alpha(t+2)}{4} \| x^* - x_{t+1}\|^2 + \frac{L^2}{\alpha (t+1)}.

= \frac{\alpha}{4} \left(t  \| x^* - x_{t}\|^2 - (t+2) \| x^* - x_{t+1}\|^2\right) + \frac{L^2}{\alpha (t+1)}.

Multiplying both sides by t+1 , we are nearly there:

(t+1)\left(f(x_t) - f(x^*)\right) \leq \frac{\alpha}{4} \left(t(t+1)  \| x^* - x_{t}\|^2 - (t+1)(t+2) \| x^* - x_{t+1}\|^2\right) + \frac{L^2}{\alpha}.

Why are we nearly there? Because we are now in a position to sum over all t from 0 to T-1 . In the right-hand side we (miraculously and serendipitously) get a telescopic sum, leaving only

\frac{\alpha}{4}\left( 0\cdot(0+1)\|x_t - x^*\|^2 - T(T+1)\|x_T - x^*\|^2 \right) + \frac{L^2T}{\alpha}.

Luckily, the first term is multiplied by 0, and the second term is negative; we can safely say then that the right-hand side is quite small:

\mathrm{RHS} \leq \frac{L^2 T}{\alpha}.

For the left-hand side, we again use convexity: Since \sum_{t=0}^{T-1}t = \frac{T(T-1)}{2} , the weighted sum

x = \frac{2}{T(T-1)}\sum_{t=0}^{T-1}(t+1)x_t

is a convex combination of vectors, and so

f\left( \frac{2}{T(T-1)}\sum_{t=0}^{T-1}(t+1)x_t\right) \leq \frac{2}{T(T-1)}\sum_{t=0}^{T-1}(t+1) f\left( x_t\right).

Plugging this into our bound finally gives

f(x) - f(x^*) \leq \frac{L^2 T}{\alpha} \cdot \frac{2}{T(T-1)} \leq \frac{2L^2}{\alpha (T-1)}.

Setting this \delta eficit to be equal to \varepsilon and rearranging finally gives

T \leq \frac{2L^2}{\alpha \varepsilon} +1

as promised. □

Some final, short remarks before we carry our descent:

  • The keen-eyed reader will notice that there is no dependence on R in this bound! The initial conditions have been forgotten. Technically, this is because we tailored \eta so that the term containing \|x_0 - x^*\| goes away; intuitively, it’s because if we start far away from the minimum then we take a large step.
  • There is a dependence on \alpha both in \eta and in T . This is not surprising: The larger the \alpha , the more curvature the function has, and the smaller we can take T . Of course, for \alpha depressingly tiny, perhaps it’s better to use the ordinary gradient descent instead. Please do not plug in \alpha = 0 .
  • Do real problems on the street actually have curvature? They might. Quadratics are abundant both in nature and in Nature. Also, the intrepid intellimancer can always artificially add \frac{\alpha}{2}\|x\|^2 to whatever function she originally intended to compute, and hope that minimizers of the modified function are at least marginally related to minimizers of the original function. All is fair in love and optimization.

Gradient descent with momentum (smooth fella)

Being strongly-convex, i.e having curvature, isn’t enough to guarantee really fast convergence. To avoid overstepping, we must take \eta_t small, which sort of ruins all the fun of taking large steps. One possible solution to this problem (and an interesting property regardless of strong convexity) is to assume that the function f is smooth. This is the formal definition:

A function f is called \beta -smooth if \nabla f is \beta -Lipschitz. In other words, for all x,y\in \mathbb{R}^n ,

\|\nabla f(x) - \nabla f(y) \| \leq \beta \|x - y\|.

Intuitively, this means that the gradient \nabla f can’t change by too much when performing gradient descent, even if we take have a relatively large stepsize. This might let us take larger steps, at least in some situations.

Technically, we’ll need the following consequence of \beta -smoothness:

Proposition: If f is a convex, \beta -smooth function, then

f\left(x - \frac{1}{\beta}\nabla f(x)\right) - f(x) \leq -\frac{1}{2\beta}\|\nabla f(x)\|^2.

In other words, if we take a gradient descent step with stepsize 1/\beta , then we reduce our value (i.e get closer to the minimum) by at least \frac{1}{2\beta}\|\nabla f(x)\|^2 . That’s not too bad!

Proof: The proof is technical, and in my opinion, not all that interesting, so you can skip it. It goes about as straightforwardly as you think it would: We express the difference of f ‘s values as an integral over its derivative, and use the smoothness property (invoking Cauchy-Schwartz) to say that this must be large.

We first show that for any two points x,y \in \mathbb{R}^n ,

f(x) - f(y) \leq \nabla f(y)(x -y) + \frac{\beta}{2}\|x-y\|^2.

This may seem at first glance to be quite identical to the inequality we had for \alpha -strongly convex functions, but don’t be fooled – the inequalities go the opposite way here (isn’t it amazing that we can extract good stuff from two opposing inequalities?). To prove the above formula, note that

\frac{d}{dt} f(y+t(x-y)) = \nabla f(y+t(x-y))\cdot(x-y);

thus, by the fundamental theorem of calculus,

f(x) - f(y) - \nabla f(y)(x-y)= f(y+1(x-y)) - f(y+0(x-y)) - \nabla f(y)(x-y)

= \int_0^1 \nabla f\left(y + t(x-y)\right)\cdot(x-y) dt - \nabla f(y)(x-y)

= \int_0^1 \left(\nabla f\left(y + t(x-y)\right) - \nabla f(y) \right) \cdot(x-y) dt.

By the Cauchy-Schwartz inequality,

\int_0^1 \left(\nabla f\left(y + t(x-y)\right) - \nabla f(y) \right) \cdot(x-y) dt \leq \int_0^1 \| \nabla f\left(y + t(x-y)\right) - \nabla f(y) \| \cdot \|x-y\| dt,

and since f is \beta -smooth, the first term is also small:

\| \nabla f\left(y + t(x-y)\right) - \nabla f(y)\| \leq \beta \|t(x-y)\|.

Combining the two inequalities, we get

f(x) - f(y) - \nabla f(y)(x-y) \leq \int_0^1 \beta t\|x-y\|^2 = \frac{\beta }{2}\|x-y\|^2.

All that remains is rearrange and plug in the gradient step, using x -\frac{1}{\beta}\nabla f (x) as x , and x as y in the above inequality:

f\left(x - \frac{1}{\beta}\nabla f(x)\right) - f(x) \leq \nabla f(x) \left( x -\frac{1}{\beta}\nabla f(x) - x\right) + \frac{\beta }{2}\| x - \frac{1}{\beta}\nabla f(x) - x\|^2

= -\frac{1}{\beta}\|\nabla f(x) \|^2 + \frac{1}{2\beta}\|\nabla f(x) \|^2 = -\frac{1}{2\beta}\|\nabla f(x) \|^2.

Neat! □

Being \beta -smooth is a second order phenomenon – it talks about how quickly the gradient changes, not how quickly the function itself changes. Exploiting this property to directly to get a really good time bound is therefore a tad harder than exploiting properties relating to f explicitly; the algorithm and proof we will see below will be the most complicated in this entire post. On the plus side, this will be a great opportunity to introduce a totally new technique to the our repertoire.

So far, the gradient descent algorithm was very dumb, and in particular, very memoryless: The new position x_{t+1} depended only on calculations performed at x_t , and totally disregarded the possibly rich and exciting history of the trajectory. This short-term approach will not do when trying to utilize the \beta -smoothness of a function (well, it won’t work naively, anyway). But it turns out that there is a way to meaningfully combine knowledge about previously taken steps together with the current position, and get some strong results.

Let’s think back to the (epic) limerick that started all of these gradient shenanigans, about water flowing down a slope. It is true that water, when situated on a slope, flows downwards. However, it is also true that water, once it starts flowing, has a hard time stopping, and often splashes and crashes in torrents and waves around rocks, trees, buildings, or whatever else stands in its way. This is because it has momentum. Having momentum allows it to cross and gloss over some of the tiny details and landscape features that would otherwise cause it to detour and take a longer path towards the glorious minimum.

Computer scientists, being awed by nature as they are, have tried to imitate this behavior by running gradient descent algorithms which tend to keep going in the same direction that they moved up till now. Intuitively, if we think in physical terms, instead of having the gradient be the instantaneous velocity of a particle, we can use it as an instantaneous acceleration instead. Here is one such algorithm:


d_t = \gamma_t(x_t -x_{t-1}),

where \gamma_t is some time-dependent coefficient. The difference x_t -x_{t-1} is basically the instantaneous velocity of a particle whose position is x_t , and if we view \gamma_t as a (time-dependent) mass, then d_t can be seen as the monementum of the particle. With this definition, the step of “gradient descent with momentum” is

x_{t+1} = x_t+d_t - \frac{1}{\beta} \nabla f(x_t +d_t).

Theorem: Let f be a \beta -smooth convex function, and let \varepsilon > 0 . With the right choice of \gamma_t , the gradient descent with momentum algorithm can find a point x so that the deficit is small, i.e \delta(x) \leq \varepsilon , in just T < \sqrt{\beta R^2 / \varepsilon} + 1 steps!

Proof: For notational convenience (and because that’s how it’s written in my notes), denote \delta_t = \delta(x_t) and g_t = -\frac{1}{\beta}\nabla f(x_t + d_t) . Under this notation, we just have

x_{t+1} = x_t + d_t + g_t,


\delta_{t+1} - \delta_t = f(x_t + d_t + g_t) - f(x_t).

Since f is \beta -smooth, we can use the proposition that we proved above applied to x_t + d_t instead of x . The proposition gives

f\left(x_t +d_t + g_t \right) - f(x_t +d_t) \leq -\frac{1}{2\beta}\|f(x_t +d_t)\|^2,

and plugging this into the difference \delta_{t+1}-\delta_t gives

\delta_{t+1} - \delta_t \leq f(x_t +d_t) -f(x_t) - \frac{1}{2\beta}\|f(x_t + dt)\|^2.

By good ol’, standard, run-of-the-mill convexity, this is bounded by

\leq \nabla f (x_t +d_t)\cdot d_t -\frac{1}{2\beta}\|f(x_t + dt_)\|^2 = -\beta g_t \cdot d_t - \frac{\beta }{2}\|g_t\|^2.

That’s that for the difference between two consecutive \delta_t ‘s. We can also bound a single \delta eficit in a similar fashion, eventually yielding

\delta_{t+1} = f(x_t+d_t+g_t) - f(x^*) \leq \ldots \leq -\beta g_t \cdot (x_t +d_t -x^*) - \frac{\beta }{2}\|g_t\|^2.

Now things get a little tricky (but only a little). When we run this momentous algorithm, say for T time steps, what we eventually care about is the final deficit \delta_{T} . But the expression above contains lots of x_t ‘s and d_t ‘s and whatnots. Our only hope for salvation is if we can somehow get some telescopic sum concerning both \delta_{t} and the differences \delta_{t+1} - \delta_t . Since the terms on the right-hand side do not automatically telescope on their own, we’re going to have to help them a bit. For this purpose, let \lambda_t \geq 1 be some real numbers that we’ll choose later. Let’s look at the expression (\lambda_t -1 )(\delta_{t+1} - \delta_t) + \delta_{t+1} :

(\lambda_t -1 )(\delta_{t+1} - \delta_t) + \delta_{t+1} \leq (\lambda_t-1)\left(-\beta g_t \cdot d_t -\frac{\beta}{2}\|g_t\|^2\right) - \beta g_t \cdot (x_t +d_t -x^*) - \frac{\beta }{2}\|g_t\|^2

= -\frac{\beta}{2}\lambda_t \|g_t\|^2 - \beta g_t \left(\lambda_t d_t + x_t -x^* \right).

Multiplying both sides by \lambda_t and rearranging the left-hand side a bit, we get

\lambda_t^2 \delta_{t+1} - \left(\lambda_t^2 - \lambda_t \right)\delta_t \leq -\frac{\beta}{2}\lambda_t^2 \|g_t\|^2 - \beta \lambda_t g_t \left(\lambda_t d_t + x_t -x^* \right).

Using the Pythagorean fact that \|a+b\|^2 = \|a\|^2 + \|b\|^2 + 2(a\cdot b) , the scalar product term in the right-hand side can be written as

\beta \lambda_t g_t \left(\lambda_t d_t + x_t -x^* \right) = \frac{\beta}{2}\left(\|x_t +\lambda_t d_t +\lambda_t g_t -x^* \|^2 -\|\lambda_t d_t +x_t -x^*\|^2 - \lambda_t ^2 \|g_t\|^2 \right)

(we took a = \lambda_t g_t and b = \lambda_t d_t + x_t -x^* ). This has the benefit of cancelling with the norm of g_t when plugging it into the deficit difference, so that

\lambda_t^2 \delta_{t+1} - \left(\lambda_t^2 - \lambda_t \right)\delta_t \leq -\frac{\beta}{2}\left(\|x_t + \lambda_t d_t +\lambda_t g_t - x^* \|^2 - \|x_t +\lambda_t d_t -x^* \|^2\right).

We’re getting somewhere! (in fact, we are nearly there). In both the left-hand side and the right-hand side, we have the differences of two terms which are very much similar. We’d be the happiest creatures on Earth if, just like in the analysis of vanilla gradient descent, they could somehow transform into telescopic sums, so that each side would sum up and practically cancel itself out, leaving us with only \delta_T in the left-hand side and some not-too-large expression in the right-hand side. Since we didn’t choose \lambda_t yet, we even have hope for making this happen. But alas, there are two different expressions we need to telescopize, but only one degree freedom in choosing the variables \lambda_t . Even if we fix \lambda_t so that the left-hand side cancels out, won’t the right-hand side be a party-pooper and spoil our cancellations?

The situation seems grim, but do not despair. When all seems lost, that is when the Deus comes out of its Machina. For remember, our notation of g_t ‘s and d_t ‘s hides the fact that underneath all the calculations floating about, we still haven’t chosen \gamma_t , the mass at step t ! We actually have two free variables to juggle, and this will give us exactly the extra leeway that we need. We’ll first choose \lambda_t so that the left-hand side telescopizes. When summing two consecutive time steps, say t and t+1 , the coefficients are

\lambda_t^2 \delta_{t+1} - (\lambda_t^2 -\lambda_t )\delta_t +    \lambda_{t+1}^2 \delta_{t+2} - (\lambda_{t+1}^2 -\lambda_{t+1} )\delta_{t+1}

= \lambda_{t+1}^2 \delta_{t+2}  + (\lambda_t^2 - \lambda_{t+1}^2 + \lambda_{t+1})\delta_{t+1} - (\ldots)\delta_t

So in order to cancel out the \delta_{t+1} in the middle, we’ll take \lambda_{t+1}^2 - \lambda_{t+1} = \lambda_t^2 . This is just a quadratic equation, which we definitely know how to solve:

\lambda_t = \frac{1 + \sqrt{1+4\lambda_{t-1}^2}}{2}.

So \lambda_t are increasing and of the order O(t) (we basically increase the value by 1/2 every time). If we set \lambda_0 = 0 (and start counting only from t=1 onwards), then we even have \lambda_1 = 1 , and so our promise that \lambda_t \geq 1 holds true.

As for the right-hand side, in order to get a telescope, all we have to do is to make sure that x_t +\lambda_t d_t + \lambda_t g_t = x_{t+1} + \lambda_{t+1} d_{t+1} . We already consumed our freedom of \lambda_t , but using the fact that d_{t+1} = \gamma_{t+1}(x_{t+1} - x_t) , the desired equality reduces to

\lambda_t (d_t +g_t) = (1+\lambda_{t+1} \gamma_{t+1})(d_t +g_t).

So all we need to do is pick \gamma_t so that \lambda_t = (1+\lambda_{t+1} \gamma_{t+1}) , i.e

\gamma_t = \frac{\lambda_t - 1}{\lambda_{t+1}}.

I’ll admit, it might seem a bit counterintuitive to choose \gamma_t based on \lambda_t , since “in principle” the weights \gamma_t are chosen first, as part of the algorithm, while the coefficients \lambda_t are just a construct for our analysis of the algorithm. But that’s how we get good weights! (or, at least, good theoretical bounds for this detached analysis. As for the real world, well, real-life people making real-life choices can pick \gamma_t to be something else entirely, if it suits them).

Anyway, having set in stone both \gamma_t and \lambda_t , all the complicated equations above reduce to the simple inequality:

\lambda_t^2 \delta_{t+1} - \lambda_{t-1}^2 \delta_t \leq -\frac{\beta}{2}\left(\|x_{t+1} + \lambda_{t+1} d_{t+1} - x^* \|^2 - \|x_t +\lambda_t d_t -x^* \|^2\right).

Summing from t=1 to t=T and remembering that \lambda_{0} = 0 (very suspicious, isn’t it?), we end up with

\lambda_T^2 \delta_{T+1} \leq \frac{\beta}{2}\|x_1 + \lambda_1 d_1 - x^*\|^2 \leq \frac{\beta}{2}R^2.

Since \lambda_T \geq T/2 (proof by Mathematica), then

\delta_{T+1} \leq \frac{\beta R^2}{T^2}.

Equating this to \varepsilon as in previous proofs gives T \leq \sqrt{\beta R^2 / \varepsilon} + 1 as needed. □

That was quite a ride! Of course, this is just the tip of the iceberg, both for \beta -smooth functions, and for gradient descent with momentum. The worst-case theoretical analysis often differs from real life, and it turns out you can be pretty smart with choices of time steps. Check out, for example, this impressive cool webpage.

Gradient descent with both curvature and smoothness

Our descent to madness is nearing its end; there’s only one more rabbit hole I’d like to peek into today. But before that, let’s recap:
  • Without any assumptions, gradient descent gets \varepsilon close to the minimum in time O(1/\varepsilon^2) .
  • For \alpha -strongly convex functions, this is improved to O(1/\varepsilon) .
  • For \beta -smooth convex functions, this is even more improved to O(1/\sqrt{\varepsilon}) .
  • However, none of this gets close to the mighty exponential accuracy of strongly convex gradient flow.

It turns out that by combining the assumptions, gradient descent can actually be as good as gradient flow!

Theorem: If f is both \alpha -strongly convex and \beta -smooth, then there is a gradient descent algorithm which gets \varepsilon -close to the minimum in time \approx \log(1/\varepsilon) .

Proof: You’re on your own here, kiddo 😉 .

Teaser trailer: Gradient descent with different geometry

If you reached this point, you must have already converted to the gradient descent deity, offering pious gifts for quick, dimension-free optimization. But it is time to reveal our darkest secret: Gradient descent is merely but a false idol of the Pure True Algorithm, an imitation, a shadow! Here are two harsh examples showcasing its imperfectness:

1) Although none of the error bounds we proved above explicitly depend on the dimension n of the space we are exploring, the dimension can sometimes implicitly sneak in through one of the properties of f such as the Lipschitz constant. Even a seemingly innocuous function can carry quite a heavy Lipschitz constant, as this sort-of-real-world example reveals:

The expert advice problem: Imagine that you are a layman who is clueless about what to do in life (aren’t we all?). In front of you are n experts, labeled 1 to n , who are eager to help you and give you advice. At each time step t , you hear out what they have to say, then choose one expert I_t \in \{1, \ldots, n\} to follow. After you have made your decision, you can retroactively see which expert gave good advice and which expert gave bad advice, in the form of a vector \ell_t \in \{0,1\}^n , where \ell_t(i) = 1 if expert number i was correct at time t , and \ell_t(i) = 0 if expert number i was wrong at time t . A strainingly-contrived scenario which this sort of thing can represent is betting on who will be the winner at the Olympic games. At the beginning of the day you pick up the newspaper to read the opinion of renown sports experts; you then bet with your friends who will win the 100-meter dash; at the end of the day, you either go to sleep satisfied and (moderately) richer, or curse the person who invented physical activities.

Obviously, your goal is to minimize the number times you followed bad advice. Machine learners like to define the average regret at time T , which is how much you are sorry for your choices when compared to following the advice of any one particular expert throughout your life:

R_T = \frac{1}{T}\max_i \sum_{t=1}^T \left(\ell_t(i) - \ell_t(I_t) \right).

Inddeed, for any particular i , the expression \ell_t(i) - \ell_t(I_t)  is 1 if the expert you followed (which is I_t ) gave bad advice while expert number i gave good advice. Just like in regular life, you want to minimize the regret.

The question is, then, how do you choose I_t ?

A decent descent solution: Since you want to minimize something (the regret), gradient descent may be the way to go! For x \in \mathbb{R}^n , define f_t(x) = -\ell_t \cdot x . This is a convex function, being just a scalar product between x and a fixed vector. Using the projection methods we described above, we’ll restrict ourselves to the simplex, which is the convex set of all probability distributions on n items:

K = \{x\in \mathbb{R}^n | \sum_i x_i = 1, x_i \geq 0\}.

Even though the function f now depends on time, the gradient descent algorithm works just as well (you can check it in the proof way way back). Starting with x_0 = (1/n,1/n, \ldots, 1/n) , we’ll then get a series of points x_t such that

\frac{1}{T}\sum_{t=1}^T \ell_t \cdot x^* - \ell_t \cdot x_t  \leq \frac{RL}{\sqrt{T}},

where x^* is the minimizer of -\sum_{t=1}^T \ell_t \cdot x in the set K . The vector x^* can be viewed as a convex combination of experts – it tells you how much you should listen to each expert, relative to one another. Being the minimizer of -\sum_{t=1}^T \ell_t \cdot x , it is actually the maximizer of \sum_{t=1}^T \ell_t \cdot x , so the bound above only gets better if we replace it by a fixed expert.

So how to pick the expert I_t ? Run the gradient descent, and at each time step, treat the vector x_t as a probability distribution on the n experts and pick one of the experts at random according to this distribution. Our bounds then show that in expectation, the regret is small:

\mathbb{E}[R_T] \leq \frac{RL}{\sqrt{T}}.

Everything seems fine and dandy, until we remember that we finally have a concerete problem, and we have to plug in the values of R and L . The distance between points R doesn’t bother us: The simplex K is contained in a ball of radius 1 , so we certainly have R<2 . It’s the Lipschitz constant that aches: Since f_t is linear, we just have \|\nabla f_t(x) \| = \|\ell_t\| ; in the worst case (the best case, actually), when \ell_t(i) = 1 for all i , we’ll have \|\nabla f(x)\| = \sqrt{n} !

This can really, really suck, especially if there is a plethora of experts to choose from (which is true in the sports industry, but also in more specialized problems in machine learning). What’s worse, is that there are algorithms for the expert advice problem which get bounds that scale as the \log of the dimension! (for example, the well-known (among its friends) exponential weights algorithm). How has the gradient forsaken us?

2) We stated earlier that all gradient descent needs is a way to calculate the gradient and a way to add vectors, so that it can potentially work when the objects you are trying to minimize over are infinite-dimensional, like continuous functions. Indeed, addition and calculating the gradient is all that the basic gradient descent requires:

x_{t+1} = x_t - \eta \nabla f(x_t).

But this is a bit misleading: The innards of all our proofs also relied on scalar products, which, while very friendly and nice, are not a general property of infinite spaces. If the vectors / functions / objects live in weird normed spaces, whose norm does not originate from an inner product, then all the precious geometric structure on which our arguments rested is lost. Even worse, in such spaces the gradient of a function is no longer a vector, so the very sum itself: x_t - \eta \nabla f(x_t) doesn’t really immediately make sense in such spaces. (For the mathematically inclined: In general normed spaces, scalar products are replaced by linear functionals, and the space of functionals is not necessarily isomorphic to original space.)

As allured, these nearly-fatal flaws can be fixed, resulting in a very general, very powerful, and (eventually) much deeper algorithm, mirror descent. The concept is actually simple: Rather than try to conduct all calculations in the original space, define a “mirror map” \Phi which carries you over to a benign, peace-loving workable space (the dual space), perform all calculations there, and then map back to the original space. So in addition to all the technical tricks and schemes we utilized in ordinary gradient descent (momentum, stochastics, strong convexity…), there is an added layer of complexity in choosing and analyzing the mirror map \Phi . The deep ends of this technique can lead to some beautiful mathematics, and also some strong results (at the very least, mirror descent can solve the expert advice problem with a bound of \sqrt{\log n} , which is much better than \sqrt{n} ). But, I think, perhaps this can wait for another post.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s