This post is about the basics of the “gradient descent” method for finding the minimum of a function. I started writing it mainly to review the optimization material of lectures by Sébastien Bubeck given in Seattle. All of the material can be found elsewhere (for example, Sébastien’s book), but I can assure you that in terms of the ratio between lame jokes and mathematical formulae, this blogpost is second to none (ah yes, but in which direction?). Hopefully someone might find this slightly-less-technical, slightly-more-nonsensical presentation useful (who knows, maybe even you!).

### High dimensional rock bottom

Here is a situation which happens to all of us in our daily lives: We are given a real-valued function $f : \mathbb{R}^n \to \mathbb{R}$ and we need to find its minimizer – a point $x^* \in \mathbb{R}^n$ such that $f(x^*) \leq f(x)$ for every other $x \in \mathbb{R}^n$. We often write this as $x^* = \mathrm{argmin}_{x\in \mathbb{R}^n }f(x)$. By “find the minimizer”, I mean algorithmically, i.e with a computer program; simply writing “$x^* = \mathrm{argmin} f(x)$” and calling it a day just won’t do.

This post will deal with solving this problem, starting with very simple methods and a lot of assumptions about our computational power, and then slowly descending into the madness that is convex optimization, describing more elaborate techniques that solve the problem either faster or with less assumptions.

What sort of assumptions do we make? Well, for starters, we’ll assume that we can easily calculate $f(x)$ for any $x$ we want, and if needed, its derivatives. We’ll also implicitly assume that $f$ has a minimum, and, if it helps, that this minimum is found in some bounded ball $B(0,\frac{1}{2}R)$ of radius $\frac{1}{2}R$ (so that the distance between any two points inside is bounded by $R$). Finally, we’ll assume that $f$ is what parental mathematicians call “well-behaved” (for some reason, they never seem to say “$f$ isn’t naughty”), i.e it doesn’t change too-wildly when you vary its input by a small amount. This can be formally phrased by saying that $f$ is $L$-Lipschitz, meaning that there is a constant $L > 0$ such that

$|f(x) - f(y)| \leq L \|x-y\|$

for all $x,y \in \mathbb{R}^n$.

Since we talk about about a computer algorithm, finding a True-Minimizer of $f$ is in general impossible, due to finite precision and finite run time. But we can find a near-minimizer, which is often almost as good as a True-Minimizer. Possibly the simplest way to do this is as follows: Divide the ball $B(0,\frac{1}{2}R)$ into a tiny tiny grid $G$, so that any point $x \in B(0,\frac{1}{2}R)$ is at distance at most $\varepsilon$ from a point in the grid; this is called an $\varepsilon$-net. Here are examples of nets and functions in one and two dimensions (higher dimensions sold separately):

Denote by $z = \mathrm{argmin}_{x \in G}f(x)$ the minimizer of $f$ among all points in the grid. Since $f$ is $L$-Lipschitz, the distance between $f$‘s value on $z$ and its value on $x^*$ is small:

$f(z) - f(x^*) \leq L \|x^* - z\| \leq L \varepsilon.$

If we know $L$, we can choose $\varepsilon$ to be small enough so that $f(z)$ is as close to the minimum as we want! Easy!

There are several problems with this approach, the cruellest of which is the sample complexity, or, the time needed to find the minimizer. The problem is that the number of points in an $\varepsilon$-net in $n$ dimensions scales like $(R/\varepsilon)^n$. For small $\varepsilon$ and even moderately-sized $n$, the number of points that the algorithm samples is enormous. That’s the curse of dimensionality for you. So although you might get away with this sort of thing in one-to-three dimensions, you can forget about it when dealing with functions of say, a thousand variables. And unfortunately, thousand-variabled-functions are everywhere these days, especially in hot buzzword-topics like Machine LearningTM.

### It’s all downhill from here

Here is a limerick from the Technion’s second course in calculus for scientists:

There once was a student named Bill,
Whose bucket was placed on a hill.
He kicked with all might,
Then ran out of sight;
Which way will the water spill?
(Originally in Hebrew:

סטודנט אחד ושמו חיים / בעט בטעות בדלי מים / הדלי נשפך / הסטודנט ברח / לאן יזלו המים?)

The obvious answer is: Opposite the gradient! Indeed, if you ever find yourself stranded on top of a mountain in a bizarre $n$-dimensional landscape, then the fastest way to go downhill is to go opposite the gradient.

The gradient descent algorithm is based on this basic observation, and indeed all it does (essentially) is take small steps in the direction opposite the gradient $\nabla f(x)$, until you find a minimizer $x^*$ of $f$. You can think of this like gently rolling a ball downhill:

This assumes, of course, that $f$ is differentiable, so it’s possible to even speak of “the gradient $\nabla f$ of $f$“. There are things you can do in case $f$ is not differentiable, but we’ll stick to this assumption for the rest of this post.

The algorithm runs in steps, and at time step $t$ remembers the current position of the ball, $x_t$. The next position is obtained by taking a small step in opposite the gradient:

$x_{t+1} = x_t + \eta \nabla f(x).$

I haven’t stated precisely in what way these steps are taken, and indeed this is the entire art of gradient descent, but that’s the overall idea. However, the keen-eyed among you will already notice a fatal flaw even in this very general program: A function can have multiple local minima, and going downhill can cause us to become stuck in one of these pits and miss out on the true minimum:

There are various ways to try and fix this (for example, we can repeat the algorithm many times from different starting points), but by far the easiest one is to assume that $f$ has only one minimum. Now we’re guaranteed that descending along the gradient will give us a global minimum. Right? Well, no:

Indeed, in flat landscapes, going downhill goes nowhere (but it goes there extremely fast), and multi-dimensional flat landscapes can be pretty tricky. So just to be on the safe side, we’re going to assume that the function $f$ is convex. i.e for all $x,y\in \mathbb{R}^n$ and all $t\in \left[0,1\right]$,

$f((1-t)x + ty) \leq (1-t)f(x) + t f(y).$

This is often stated as “the graph of a convex function is always below its chords” (see image above), and rules out the possibility of having such plateaus which are the minimum. Convex functions have only one minimal value, and as we’ll see, also convexity ensures that the various gradient descents indeed reach this minimum (and quickly so!). So from now on and until the end of time, we’ll always assume that $f$ is convex.

Convexity lets us bound how close a point $x$ is to being a minimizer $x^*$. Define the “deficit” (or should I say, “$\delta$eficit”) by

$\delta(x) = f(x) - f(x^*).$

Since $x^*$ is the minimizer of $f$, we always have $\delta(x) \geq 0$, and our goal to minimize $f(x)$ translates into a goal to find $x$ such that $\delta(x)$ is really really small. To this end, we use a common alternative description of convexity, which says that a convex function is always above its tangent:

Mathematically, this is written as

$f(x) - f(y) \leq \nabla f(x) \cdot (x - y).$

To see this, rewrite the convexity assumption as

$f(x +t(y-x)) \leq (1-t)f(x) + t f(y).$

Subtract $f(x)$ from both sides and divide by $t\|y-x\|$ to get

$\frac{f(x +t(y-x))-f(x)}{t\|y-x\|} \leq \frac{f(y)-f(x)}{\|y-x\|}.$

Sending $t \to 0$, the left-hand side becomes the directional derivative of $f$ in direction $y-x$ at the point $x$. Multiplying by $-\|y-x\|$ then gives the result, since $\nabla f(x) \cdot (y-x)$ is exactly equal to $\|y-x\|$ times the directional derivative of $f$ in direction $y-x$.

Using this result with $y = x^*$, we get a bound on the deficit:

$\delta(x) \leq \nabla f(x) \cdot (x - x^*) = -\nabla f(x) \cdot (x^* - x).$

Let’s first show that going in the direction opposite the gradient indeed gets you to the minimum. Here, by “going” I actually mean “flowing”, i.e we’ll treat the position of our ball a function of continuous time, and update it according to a differential equation. This is a very physicsy thing to do, and is something that computers have a hard time with, since unlike mathematicians, they cannot make infinitely many infinitesimally small steps in a single time moment. Indeed, a computer trying to solve a differential equation will have to discretize time in some way, and this, as we’ll see, introduces errors. But for now, the Platonic ideal of gradient flow will give us intuition as to how going downhill eventually gets us to the bottom (geez, when you put it this way, it really sounds obvious).

Formally, we say this. Let $x_0\in \mathbb{R}^n$. The gradient flow is a function $x(t)$ which starts at $x(0) = x_0$ and is governed by the differential equation:

$\frac{d}{dt}x(t) = - \nabla f(x(t)).$

The hope is that if we “run” the differential equation for a long enough time $\tau$, then $x(\tau)$ will be very close to the minimizer $x^*$. This is quite true. It can be seen by example in the following two images, which show the path of a gradient flow on both the surface plot and heatmap of a two-dimensional function:

It can also be seen by the following theorem:

Theorem: For every time $\tau > 0$,

$\delta(x(\tau)) \leq \frac{\|x_0 - x^*\|^2}{2\tau }.$

Proof: Let’s look at the distance between $x(t)$ and $x^*$. The rate of change of the square of that distance is

$\frac{d}{dt}~\frac{1}{2}\|x(t) - x^*\|^2 = \left(\frac{d}{dt}(x(t) - x^*)\right)\cdot \left(x(t) - x^* \right) .$

Since $x^*$ is a constant and does not depend on time, the left term in the inner product is just equal to $-\nabla f(x(t))$. So

$\frac{d}{dt}~\frac{1}{2}\|x(t) - x^*\|^2 = -\nabla f(x) \cdot (x(t) - x^*) \leq -\delta(x(t)),$

where the last inequality used the inequality for $\delta(x)$ that we obtained by convexity of $f$. This is a nice intermediate result: We would have really liked it if we could flow directly in the direction $x^* - x(t)$. Instead, all we can do is flow in the direction of $-\nabla f(x(t))$, but as the above equation shows, these two vectors are positively correlated, since $\delta(x)$ is always positive. So while flowing along the gradient won’t necessarily take us on the shortest route towards the minimizer, at least it’s always going in the right direction.

Anyways, now we can rearrange and integrate both sides from $0$ to $\tau$:

$\int_0^\tau \delta(x(t))dt \leq -\int_0^\tau \frac{d}{dt}~\frac{1}{2}\|x(t) - x^*\|^2 dt.$

The right-hand side is just equal to $\frac{1}{2}\|x(0)-x^*\|^2 - \frac{1}{2}\|x(\tau)-x^*\|^2 \leq \frac{1}{2}\|x_0 - x^*\|^2$, and dividing by $\tau$ gives

$\frac{1}{\tau} \int_0^\tau \delta(x(t))dt \leq \frac{\|x_0 - x^*\|^2}{2\tau}.$

Now, $\delta(x(t))$ is a decreasing function of $t$; this can be seen, for example, by differentiating:

$\frac{d}{dt}\delta(x(t)) = \frac{d}{dt} (f(x(t))-f(x^*)) = \nabla f(x(t)) \cdot \frac{d}{dt}(x(t))= - \|\nabla f(x(t)) \|^2 \leq 0.$

Thus $\delta(x(\tau)) \leq \frac{1}{\tau} \int_0^\tau \delta(x(t))dt$, which proves the theorem. □

This process actually gave us another point $x$ for which the deficit $\delta$ is small: Since $\delta(x)$ is a convex function of $x$, we always have

$\delta\left(\frac{1}{\tau} \int_0^\tau x(t)dt\right) \leq \frac{1}{\tau}\int_0^\tau \delta(x(t))dt,$

so the point $x = \frac{1}{\tau}\int_0^\tau x(t)dt$ also satisfies the inequality stated in the theorem. This trick will come in handy in the future, where we won’t be able to guarantee that the last position is the best one.

The term $\|x_0 -x^*\|^2$ is sort of necessary: If we happen to start very far off from the minimizer, we’re going to have to flow quite a lot. But it’s really not too bad, especially if we assume as before that we know that the minimizer is at some distance $R$ from the initial point; if we want an $\varepsilon$ error in the function’s value, we only need to run the gradient flow for $R^2 / 2\varepsilon$ time.

Note: When time is continuous, it can always be rescaled: For example, instead of the original differential equation, we could have mandated that $x(t)$ evolve according to $\frac{d}{dt}x(t) = -K \nabla f(x(t))$, where $K$ is as large a constant as we want; this corresponds to flowing “faster” by a factor of $K$. But as mentioned above, this continuous-time example is just supposed to serve as intuition; the next algorithm, which is indeed discrete and can actually be implemented by a real bit-chomping computer, won’t indulge in such luxuries.

The classical way of solving differential equations is to discretize time. Quite informally, instead of writing $\frac{d}{dt}x(t) = - \nabla f(x(t))$, we write $dx(t) = - \nabla f(x(t))dt$, and instead of the infinitesimal $dt$, we take a small time step which we denote by $\eta$. Our algorithm then reduces to iterating the following difference equation:

$x_{t+1} = x_t - \eta \nabla f(x_t)$

Note: There are fancier ways of numerically solving differential equations, such as Runge-Kutta, which will arguably (and probably provably) produce more accurate discretization results. But I favor this simple single-step method, because its analysis is easier (and more importantly, because it’s the method that was covered in the summer school I attended…)

Choosing the step size $\eta$ is an important part in the design of the algorithm. The gradient changes from place to place, so if $\eta$ is too large, we can easily overstep and miss the optimal, flow-induced trajectory, giving a less-than-optimal trajectory:

Doing this can be dangerous, potentially even forcing the algorithm into an infinite loop. For example, consider the very simple case of the one-dimensional $f(x) = x^4/4$. The negative gradient in this case is just $- \nabla f(x) = -x^3$, so the difference equation becomes

$x_{t+1} = x_t - \eta x_t^3.$

The minimum of this function is of course $x^* = 0$. Suppose that at some point we just happen to arrive at $x_t = \sqrt{2/\eta}$. Then the next position position will be

$x_{t+1}= \sqrt{2/\eta} - \eta \left(\sqrt{2/\eta}\right)^3 = \sqrt{2/\eta} - 2 \sqrt{2/\eta} = -\sqrt{2/\eta} = -x_t,$

and from this point on the algorithm will just bounce back and forth between $\sqrt{2/\eta}$ and $-\sqrt{2/\eta}$, missing the minimum $0$ entirely! Overshooting the minimizer will happen less often if $\eta$ is small.

Here is a catastrophic failure of the algorithm for the example we used so far (the image shows 100 steps of gradient descent!):

On the other hand, too small an $\eta$ means that we will need to perform many, many steps in order to get close to $x^*$. Viewed another way, suppose that we decide on the number of steps that we are going to run the algorithm beforehand, in advance, and only then choose $\eta$. If we choose $\eta$ too small, we aren’t going to move very far from the original starting position.

To conclude this discussion: Choosing $\eta$ too large can cause overstepping, meaning that we take steps which are too large in unwanted directions which do not directly lead us towards $x^*$. Choosing $\eta$ too small gives a better approximation of the continuous differential equation, but may lead to super-slow convergence, so the final $x_T$ is not even close to $x^*$. The gradient descent algorithm has to balance between these two types of errors: The “discretization error”, and the “forget the initial condition error”. The formal analysis is given below:

Theorem: Let $f$ be convex and $L$-Lipschitz and let $\varepsilon > 0$. With proper choice of $\eta$, the gradient descent algorithm can find a point $x$ so that the deficit is small, i.e $\delta(x) \leq \varepsilon$, in just $T = \frac{R^2 L^2}{\varepsilon^2}$ steps.

Proof: Similarly to the gradient flow analysis, let’s look at how the distance between $x_t$ and $x^*$ improves when the algorithm took one step. By the cosine law,

$\| x^* - x_{t+1}\|^2 = \| x^* - x_{t} + \eta \nabla f(x_t)\|^2 = \| x^* - x_{t}\|^2 + \|\eta \nabla f(x_t) \|^2 + 2(x^* - x_t)\cdot (\eta \nabla f(x_t)).$

Rearranging, we have

$\| x^* - x_{t+1}\|^2 - \| x^* - x_{t}\|^2 = \|\eta \nabla f(x_t) \|^2 + 2(x^* - x_t)\cdot (\eta \nabla f(x_t)).$

The term on the left-hand side tells us by how much the distance to $x^*$ improved; we want this quantity to be as negative as possible. Now, the first term on the right-hand side is always positive, and works against us; but the second term can make up for it, for as we know, $\nabla f(x_t)$ is negatively correlated with $x^* - x_t$. To see how this plays out, multiply by $\left(-1\right)$ and rearrange again to see how large this correlation is:

$\nabla f(x_t) \cdot (x_t - x^*) = \frac{\| x^* - x_{t}\|^2 - \| x^* - x_{t+1}\|^2}{2\eta} + \frac{\eta}{2}\|\nabla f(x_t)\|^2.$

Suppose that we run the algorithm for $T$ steps. Summing up the above expression for all times $t=0,\ldots,T-1$, the distances $\|x^* - x_t\|^2$ form a telescoping sum, and after a division by $T$ we get

$\frac{1}{T}\sum_{t=0}^{T-1} \nabla f(x_t) \cdot (x_t -x^*) = \frac{\| x^* - x_{0}\|^2 - \| x^* - x_{T}\|^2}{2\eta T} + \frac{\eta}{2T} \sum_{t=0}^{T-1}\|\nabla f(x_t) \|^2.$

The first term on the right-hand side is certainly smaller than $\| x^* - x_{0}\|^2/2\eta T$, which itself is smaller than $R^2 / 2\eta T$, and since $f$ is $L$-Lipschitz, the second term is smaller than $\eta L^2 /2$. Thus

$\frac{1}{T}\sum_{t=0}^{T-1} \nabla f(x_t) \cdot (x_t -x^*) \leq \frac{1}{2}\left( \frac{R^2}{\eta T} + \eta L^2\right).$

So far we haven’t even used the convexity of $f$, and the sums were valid for any Lipschitz function. But now, define $x = \frac{1}{T} \sum_{t=0}^{T-1}x_t$. Then by the basic definition of convexity,

$f(x) - f(x^*) = f\left(\frac{1}{T} \sum_{t=0}^{T-1}x_t\right) - f(x^*) \leq \frac{1}{T}\sum_{t=0}^{T-1}f(x_t)-f(x^*).$

But we also saw that for convex functions, $f(x) - f(y) \leq \nabla f(x) \cdot (x - y)$ for all $x$ and $y$. Thus each term in the right-hand side is smaller than $\nabla f(x_t) \cdot (x_t -x^*)$, and combining this with the expression we got before, we reach the stunning result:

$\delta(x) = f(x) - f(x^*) \leq \frac{1}{2}\left( \frac{R^2}{\eta T} + \eta L^2\right).$

As if by divine prophecy, we see here exactly the two types of errors that we promised we would see: The first, $R^2 / \eta T$, comes from having a step size too small to get from $x_0$ to $x^*$ in any reasonable time. You can see that it grows smaller as both $T$ and $\eta$ grow larger. The second, $\eta L^2$, stems from overshooting. As per our dictated intuition, the larger the step size $\eta$, the larger the error; but the error also grows with the Lipschitz constant of $f$. In other words, if $f$ can have very steep ridges and valleys, i.e it changes too quickly over short distances, these overshoots will cause us to miss the mark.

(The keen-eyed reader will also note another divine prophecy – the algorithm did not eventually take $x = x_T$ (we have almost no information on how good $x_T$ is), but rather used convexity and took $x$ to be the average of all points visited by the algorithm. Now, by the pigeonhole principle, one of the sample points must be better than the average, so in principle if you can easily calculate $f$, you can take the best $x_t$. But this just serves to highlight an interesting feature of gradient descent: It only needs to calculate $\nabla f(x)$ in order to work, and not $f(x)$ itself!)

Alright! All that’s left to do is optimize over $\eta$. Choose $\eta = \frac{R}{L \sqrt{T}}$ to get

$\delta(x) \leq \frac{RL}{\sqrt{T}}.$

Setting the $\delta$eficit to $\varepsilon$ and rearranging gives

$T \leq \frac{R^2 L^2}{\varepsilon^2}$

as promised by the theorem. □
That’s it for the basic gradient descent algorithm. It’s actually quite decent. The assumptions we made – that $f$ is Lipschitz and that you know its Lipschitz constant, for example – aren’t that far-fetched, and in the worst case you can always guess $L$ and/or $R$ and run the algorithm multiple times. It has the curious property that you don’t even need to be able to calculate $f$ to approximate its minimum, but rather only its gradients! This is nice in applications where computing $f$ is hard but computing $\nabla f$ is relatively easy. Finally, in terms of runtime, I think that making $R^2 L^2/4 \varepsilon^2$ gradient calculations is much better than making the $(R/\varepsilon)^n$ function calculations of our naive brute-search algorithm almost any day (well, any day that $n>2$).

The brutal efficiency of gradient descent actually caught me off guard when I first saw it. Nowhere at all does the approximation bound or the algorithm itself explicitly mention the dimension; it works just as well for $n=2$ and $n=1000$. In fact, it even works for infinite dimensional spaces, such as function spaces! assuming you have some practical way to calculate gradients and perform addition in those cases, of course. But more on that will come (much) later. For now, let’s see how gradient descent can be tweaked both to work in harsher, less-optimistic situations, and to converge faster in smoother, more forgiving conditions.

Under the Universal Declaration of the United Nations, freedom of movement is a fundamental human right. However, this is not always the case for the points $x_t$ in the gradient descent algorithm. Often we are interested in minimizing a function not over the entire domain $\mathbb{R}^n$, but rather over a smaller set; for example, we might wish to find the minimizing vector of length $\|x\| \leq 1$, or a vector whose sum of coordinates is $\sum_{i=1}^{n}x_i = 0$, or a vector which satisfies a prescribed list of linear inequalities, $Ax \geq \mathbf{b}$. These conditions are called constraints. Alas, the original gradient descent algorithm has no knowledge of these constraints, and eagerly carries the points $x_t$ outside the allowed set.

Luckily, this problem has an easy solution, in case the constraint set $K$ is convex, i.e for all points $x,y \in K$ and all $t \in \left[0,1\right]$, the point$\left(1-t\right)x + ty$ is also in $K$.

This demand makes sense, since we are dealing with convex functions anyway, and often holds in real life situations (the three constraint examples given above all define convex sets. Actually, they define closed convex sets, and indeed we’ll assume that $K$ is closed).

The solution is straightforward: After each gradient step, we project the point $x_{t+1}$ back into the set $K$. By projection, we mean that $x$ is sent to the point $y$ in $K$ that is closest to $x$. More formally, for any set $K \subseteq \mathbb{R}^n$, the distance between $K$ and $x$ is defined as

$d(x,K) = \inf_{y\in K}\|x-y\|.$

If $K$ is closed then the infimum is actually a minimum (pretty much by definition), so there exists a point $y_0 \in K$ such that $d(x,y_0) \leq d(x,y)$ for all $y\in K$; and if $K$ is convex, this point $y_0$ is unique (if there were two points of minimal distance, say $y_0$ and $y_1$, then the midpoint between them would be closer to $x$ than either!). We then define the projection onto $K$ by $P_K(x) = y_x = \mathrm{argmin}_{y\in K}\|x-y\|$.

Note that convexity is indeed important for uniqueness:

Our new, boundary-respecting gradient descent algorithm then simply applies $P_K$ at each step:

$x_{t+1} = P_k(x_t - \eta \nabla f(x_t)).$

Now, you might think that with another layer wrapping the $x_t$ and gradient, the analysis of this new algorithm will be harder, or the convergence slower. But this happens not at all; in fact, projecting onto $K$ makes us converge faster!

Claim: For any $x \in \mathbb{R}^n$ and any $z\in K$, $\|P_K(x)-z\| \leq \|x-z\|$.

Proof: If $x\in K$ then $P_K(x) = x$ and there is nothing to prove, so we assume that $x$ is not in $K$. Since we are only interested in distances between points, we may translate and scale both the convex body $K$ and $x$ together whichever way we want without affecting the relation between the distances in the claim, so for our general convenience, assume that $P_K(x) = 0$ and that $\|x\| = 1$. This means that the distance $d(x,K)=1$ under our scaling. The vector $x$ naturally defines a hyperplane $H$ by $H = \{y\in \mathbb{R}^n \ x\cdot y = 0\}$. Like all hyperplanes, this hyperplane divides $\mathbb{R}^n$ into two halfspaces: the set of all points with scalar product greater than $0$, and set of all points with scalar product less than $0$. As it turns out, the entire set $K$ is contained in this latter halfspace, i.e for all $z \in K$, $x \cdot z \leq 0$.

To see this, Let $v_1 = x$ and complete $v$ to an orthonormal basis $B=\{v_1,\ldots,v_n\}$. Any point $y = (\alpha_1,\ldots,\alpha_n)$ written in this basis which has positive scalar product with $x$ must have $\alpha_1 > 0$. Since $K$ is convex and $0 \in K$, if there exists such a $y \in K$, then for all $t \in \left[0,1\right]$ the point

$y_t = (t\alpha _1, t\alpha_2,\ldots,t\alpha_n)$

is in $K$ as well. The squared distance between $y_t$ and $x$ is given by

$\|x -y_t \|^2 = \left(1 - t \alpha_1\right)^2 + \left(t\alpha_2 \right)^2 + \ldots + \left(t\alpha_n \right)^2$

$= 1 - 2t\alpha_1 + (t\alpha_1)^2 + \left(t\alpha_2 \right)^2 + \ldots + \left(t\alpha_n \right)^2.$

For $t$ small enough (but greater than $0$), all the squared terms are negligible compared to the term $2t\alpha_1$; in fact, we can take $t$ so that

$\|x -y_t \|^2 \leq 1- t\alpha_1.$

But this contradicts the fact that the distance between $x$ and $K$ is $1$! Hence all $z \in K$ must have a negative $v_1$ component, and so their distance from $x$ must be greater than $1$. □

Applying $P_K$ onto $x_{t+1}$ therefore takes us closer to every point in $K$, and in particular, it takes us closer to the minimizer $x^*$. Plugging this fact into the analysis of the previous section, we get exactly the same error bounds as in the unconstrained gradient descent algorithm. In other words, as the locals around here like to say, gradient descent eats convex constraints without salt.

(Of course, we implicitly assume here that we can calculate the projection $P_K$ easily. Sometimes this is easy (for example, when you are dealing with linear constraints, or projecting onto spheres), but I can imagine that in some applications the set $K$ is given rather implicitly, and you’d have to go deeper into this rabbit hole to get good approximations.)

### Gradient descent without knowing the time

One complaint that often gets filed in the deep archives of the Bureau of Gradient Descent is that $\eta$ is given by

$\eta = \frac{R}{L \sqrt{T}}.$

The problem with this expression is that it requires knowing the number of steps $T$ that the algorithm is going to take in advance. This is fine in some cases – for example, if you really just want an error guarantee of $\varepsilon$, take $T$ to be $T = \frac{R^2 L^2}{4\varepsilon^2}$ and let her rip. But sometimes it’s good to have a more adaptive algorithm, that runs and runs and runs and keeps getting better and better until you tell it to stop. In this case you do not know $T$ in advance, so what do you do with $\eta$?

Luckily, there is a very simple fix. Instead of having the step-size $\eta$ be constant throughout each step, use a time-dependent step-size:

$x_{t+1} = x_t - \eta_t \nabla f(x_t).$

When you plug this in the gradient descent algorithm, you get

$\frac{1}{T}\sum_{t=0}^{T-1} \nabla f(x_t) \cdot (x_t -x^*) = \frac{1}{T} \sum_{t=0}^{T} \frac{\| x^* - x_{t}\|^2 - \| x^* - x_{t+1}\|^2}{2\eta_t T} + \frac{1}{2T} \sum_{t=0}^{T-1}\eta_t\|\nabla f(x_t) \|^2.$

The more pretentious among us might say that it’s an entire art form in itself to pick the right $\eta_t$, but actually in this particular simple case, choosing the right $\eta$ is not that difficult. Here are two relatively-easy-to-analyze choices:

• In the first, take $\eta_t = R/\left(L\sqrt{t+1}\right)$.
• In the second, set $t_i = 2^i$, and take $\eta_t = \frac{R}{L}\cdot 1/\sqrt{t_i}$ if $t_i \leq t < t_{i+1}$. In words, your $\eta_t$ is a step-function approximation of $1/\sqrt{T}$, with longer and longer steps as time goes on.

The calculations which show that these kind of things work aren’t that friendly, but are certainly doable if you rely on the proof of the basic gradient descent algorithm, so I won’t repeat them here. In any case it should be noted, that in some practical, real-world, actual, useful, non-theoretical applications, the total runtime $T$ is not that large anyways since empirically the algorithm may converge very fast. Also, in reactive or real-time systems, who has the patience to run a million steps of gradient descent every time they want to do something?

Not surprisingly, the gradient descent algorithm assumes that we can calculate the gradient, and if we want the algorithm to really, actually, work in practice, we should also be able to calculate the gradient quickly. But real-life isn’t always nice that way, and many times calculating $\nabla f$ can be prohibitive. One real-world example is the parameter optimization of artificial neural networks. I won’t go into the details in this post (for that, you can watch the amazing “neural network” video series by 3blue1brown, found here), but the gist is that neural networks are created by “training” them, a process that involves minimizing a cost function $f$ which depends on the neural weights $w$ of the network. This cost function usually measures the error of the network on a “training set” of samples $S$:

$f(w) = \sum_{s\in S} \mathrm{error}(w,s),$

where $\mathrm{error}(w,s)$ is some function that measures how well the network with weights $w$ performed on the input $s$. The usual paradigm in artificial neural networks concerning the size of the training set is “the more the merrier, and keep ’em coming!” But if there are hundreds of thousands, or even billions of samples, calculating the gradient of the cost function even at a single point can put a large hole in your time-wallet.

Note: The loss functions in neural networks usually aren’t convex. That really doesn’t stop people from doing gradient descent anyways.

Whatever will the artificial intellimancers do? One common tactic that all of us frequently use when faced with lots of work but little time to do it (apart from procrastinating), is to just not do all the work. So the intellimancers asked themselves, “Instead of going over the entire sample set $S$ in the above example, is it possible to approximate the gradient by going over just a small portion of it?”

The answer is, of course, a resounding yes! The “gradient” in “gradient descent” can actually be replaced by any vector, as long as on average that vector is equal to the gradient. Doing this is often called “stochastic gradient descent”, and it deserves its own theorem.

Theorem: For $t=0,1,\ldots$, define the difference process

$x_{t+1} = x_t + \eta g_t,$

where $g_t$ are some (random) vectors which can possibly depend on $x_t$. Let $f$ be a convex function, and assume that

$\mathbb{E}\left[g_t \mid x_t \right] = \nabla f\left(x_t\right)$

and

$\mathbb{E} \|g_t\|^2 \leq B^2.$

Then in expectation, gradient descent works! More precisely, after $T = \frac{R^2B^2}{\varepsilon^2}$ time steps, the stochastic gradient descent algorithm finds a point $x$ such that the expected error $\mathbb{E}\left[\delta\left(x\right) \right]$ is smaller than $\varepsilon$.

Proof: The calculations are actually almost exactly the same as the ones we did for showing that vanilla gradient descent works. We again look at how the distance between $x_t$ and $x^*$ diminishes as a function of time:

$\| x^* - x_{t+1}\|^2 = \| x^* - x_{t} + \eta g_t\|^2 = \| x^* - x_{t}\|^2 + \|\eta g_t \|^2 + 2\left(x^* - x_t\right)\cdot \left(\eta g_t\right).$

Rearranging, we once again get

$\frac{1}{T}\sum_{t=0}^{T-1} g_t \cdot (x_t -x^*) = \frac{\| x^* - x_{0}\|^2 - \| x^* - x_{T}\|^2}{2\eta T} + \frac{\eta}{2T} \sum_{t=0}^{T-1}\|g_t \|^2.$

So far we haven’t done anything new, and in fact haven’t even used any property of $g$ whatsoever; the above formula is true for any set of vectors $g_t$. But when take the expectation, everything falls into place: On the right-hand side, since $\mathbb{E}\left[\|g_t\|^2\right] \leq B^2$, the second term becomes

$\mathbb{E}\left[\frac{\eta}{2T} \sum_{t=0}^{T-1}\|g_t \|^2\right] \leq \frac{\eta}{2T} \sum_{t=0}^{T-1}B^2 = \frac{\eta B^2}{2}.$

On the left-hand side, the expectation of each term can be calculated using conditioning:

$\mathbb{E}\left[g_t \cdot \left(x_t -x^*\right)\right] = \mathbb{E}\left[\mathbb{E}\left[g_t \cdot \left(x_t -x^*\right) \mid x_t \right]\right]$

$= \mathbb{E}\left[\mathbb{E}\left[g_t \mid x_t \right] \cdot \left(x_t -x^*\right)\right] = \mathbb{E}\left[\nabla f\left(x_t\right) \cdot \left(x_t -x^* \right)\right].$

After doing this, we’ve basically eliminated all the randomness in $g_t$ by putting things inside an expected value, and we get the now-familiar equation

$\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1} \nabla f(x_t) \cdot (x_t -x^*)\right] \leq \frac{1}{2}\left( \frac{R^2}{\eta T} + \eta B^2\right).$

Note that we still need to take expectation in the left-hand side, since $x_t$ is a now a random variable itself. Also note that the role of $L$ has been taken by $B$. You might think that this removes the Lipschitzness assumption from $f$, but it doesn’t really; in fact, if $g_t$ is supposed to be a stochastic approximation of $\nabla f\left(x_t\right)$, then we really should expect $B$ to be greater than $L$ (intuitively, any time that $\|g_t\| \leq \|\nabla f\left(x_t\right)\|$, we’d need to have a balancing case where $\|g_t\| \geq \|\nabla f\left(x_t\right)\|$, but this works out in $L$‘s favor) . Anyway, we can now continue as we did before: By convexity for $x = \sum_{t=0}^{T-1}x_t$,

$\mathbb{E}\left[f\left(x\right) - f\left(x^*\right)\right] = \mathbb{E}\left[f\left(\frac{1}{T} \sum_{t=0}^{T-1}x_t\right) - f\left(x^*\right)\right] \leq \mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}f\left(x_t\right)-f\left(x^*\right)\right],$

and also by convexity, the right-hand side is equal to $\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1} \nabla f\left(x_t\right) \cdot \left(x_t -x^*\right)\right]$. So in total,

$\mathbb{E}\left[\delta\left(x\right)\right] = \mathbb{E}\left[f(x) - f(x^*)\right] \leq \frac{1}{2}\left( \frac{R^2}{\eta T} + \eta B^2\right).$

One again, choose $\eta = \frac{R}{B \sqrt{T}}$ to get

$\mathbb{E}\left[\delta\left(x\right)\right] \leq \frac{RB}{ \sqrt{T}}.$

Setting the expected $\delta$eficit to $\varepsilon$ and rearranging gives

$T \leq \frac{R^2B^2}{\varepsilon^2}.$

The keen-eyed skeptics among you should slightly raise an eyebrow, and wonder about the usefulness of such a bound. After all, the algorithm only says that the expectation of the error is small; how can you know whether an individual run is good or not? How can you be sure that, while the expectation is small, you won’t have to occasionally run the algorithm for a tremendous amount of time?

Luckily for us, the quantity $\delta\left(x\right)$ is positive, so we can apply Markov’s inequality to it: For every $\alpha > 0$,

$\mathbb{P}\left[\delta\left(x\right) \geq \alpha \right] \leq \frac{\mathbb{E}\left[\delta\left(x\right)\right]}{\alpha} \leq \frac{\frac{RB}{ \sqrt{T}}}{\alpha}.$

Choosing $\alpha = \frac{2RB}{ \sqrt{T}}$, we get that

$\mathbb{P}\left[\delta\left(x\right) \geq \frac{2RB}{ \sqrt{T}} \right] \leq \frac{1}{2}.$

You can then make sure that you get an actual good $x$ with low error by running the stochastic gradient descent algorithm in parallel, using independent randomness:
1. Set $T = \frac{R^2B^2}{\varepsilon^2}$, so that the expected error is smaller than $\varepsilon$.
2. Run $K$ independent copies of stochastic gradient descent, each for $T$ time steps, giving final points $x^{\left(1\right)}, \ldots, x^{\left(k\right)}$.
3. Return $\min\{f\left(x^{\left(1\right)}\right),\ldots, f\left(x^{\left(k\right)}\right)\}$.

Each $x^{\left(i\right)}$ has a probability of at least $1/2$ of giving an error less than $2\varepsilon$, so the probability that all of the $x^{\left(i\right)}$‘s suck is smaller than $2^{-K}$. So in order to get an error less than $\varepsilon$ with probability at least $1-p$, you only need to (collectively) run the algorithm for $\frac{R^2B^2}{\varepsilon^2} \cdot \log_2\left(1/p\right)$ times. Not too bad!

Exercise for the reader: The above algorithm requires you to calculate $f\left(x^{\left(i\right)}\right)$, which means being able to calculate $f$. This is something that the deterministic gradient algorithm did not require. Can you find a way to get, with exponentially high probability, a point $x$ with good error bound, without ever calculating $f$ on any point?

By the way, after all this talk and proof, we can finally see the solution to the gradient computation problem in artificial neural networks. Instead of calculating the entire gradient of the cost function which goes over the entire training set $S$, pick at random some small subset of it, and calculate the gradient of the cost function only on those elements. In expectation, the two behave the same, and if the cost function is not naughty, the bound $B$ will not be too large. Then just apply the above theorem. Actually, for $B$ not to be large, you might need to take quite a lot of samples. There are techniques for combining the deterministic gradient descent and the stochastic one together, but I won’t talk about them in this post.

It’s been a while since we last saw pretty pictures, so here is a simulation showing the difference between ordinary gradient descent and stochastic gradient descent (the vectors $g_t$ are obtained from $\nabla f\left(x_t\right)$ by simply adding a Gaussian vector as noise).

When somebody says “convex function”, what pops up in your head? If you’re anything like me (my condolences), it’s probably something like this:

That’s fine, but you will do well to remember that even a linear function is convex:

The difference between the two images, of course, is curvature, and sometimes it makes sense to distinguish between functions which have curvature (such as $x^2$ or $e^x$) and functions which do not, such as $3x+1$. Indeed, by definition, a convex function is always found above its tangent, and in this sense a linear function is the least-convex of all convex functions: The inequality is so poor, it’s actually an equality, since the function is its own tangent. This calls for a stronger notion of convexity, which we will aptly call “strong convexity”.

A function $f$ is called $\alpha$-strongly convex if we can write

$f(x) = g(x) + \frac{\alpha}{2}\|x\|^2,$

where $g$ itself is a convex function. In words, a strongly-convex function is at least as curved as a parabola. In particular, this means that the function is quadratically greater than its tangent:

Claim: If $f$ is an $\alpha$-strongly convex function, then

$f(y) \geq f(x) + \nabla f(x)\cdot (y-x) + \frac{\alpha}{2}\|x-y\|^2.$

Proof: Luckily we already did all the work for proving this! Rearranging the definition of strong convexity, we have that

$g(x) = f(x) - \frac{\alpha}{2}\|x\|^2$

is a convex function. We already proved earlier that convex functions are above their tangent, so

$g(x) - g(y) \leq \nabla g(x) \cdot (x-y).$

The gradient of $g$ at $x$ is just equal to $\nabla f(x) - \alpha x$, and plugging this in and rearranging yields

$f(y) - f(x) \geq \nabla f(x) \cdot (y-x) - \frac{\alpha}{2}\|x\|^2 + \frac{\alpha}{2}\|y\|^2 + \alpha x \cdot (x-y).$

The last term on the right-hand side is just $\alpha \|x\|^2 - \alpha x\cdot y$, so

$f(y) - f(x) \geq \nabla f(x) \cdot (y-x) + \frac{\alpha}{2}\left(\|x\|^2 + \|y\|^2 - 2 (x \cdot y)\right).$

By the cosine law, this is exactly what we want! □

Being strongly convex can strongly help us when performing gradient descent, and the stronger, the better. Intuitively, the strong-convexity condition forces the gradient to be very large whenever we are far away from the minimum. This means we can take $\eta$ to be relatively small (so that we don’t overshoot when near the minimizer), and still be able to take large steps when we are far away from from the minimizer. With any luck (or, in this case, prior knowledge of the proofs), this will lead to faster convergence, so that less steps are needed in the algorithm to achieve the same amount of accuracy.

To see just how useful curvature can be, let’s consider first gradient flow, just like we did for the unstrong case. We start at some $x(0) = x_0 \in \mathbb{R}^n$, and let $x(t)$ evolve according to the differential equation

$\frac{d}{dt}x(t) = -\nabla f(x(t)).$

As before, a good way to analyze this equation is to look at the square distance of $x(t)$ from the true minimizer $x^*$.

$\frac{d}{dt}~\frac{1}{2}\|x(t) - x^*\|^2 = \left(\frac{d}{dt}(x(t) - x^*)\right)\cdot \left(x(t) - x^* \right) .$

As before, since $x^*$ is a constant and does not depend on time, the left term in the inner product is just equal to $-\nabla f(x(t))$. So

$\frac{d}{dt}~\frac{1}{2}\|x(t) - x^*\|^2 = -\nabla f(x(t)) \cdot \left(x(t) - x^* \right) .$

Now we can use the above claim about strong-convexity to bound the size of the scalar product $\nabla f(x(t)) \cdot \left(x(t) - x^* \right)$; after slight rearranging, we get

$\frac{d}{dt}~\frac{1}{2}\|x(t) - x^*\|^2 \leq -\delta\left(x(t)\right) -\frac{\alpha}{2}\|x(t) -x^*\|^2.$

Since $\delta(x)$ is always positive, we arrive at a very pleasant differential inequality:

$\frac{d}{dt}~\|x(t) - x^*\|^2 \leq -\alpha \|x(t) -x^*\|^2.$

In particular, the quantity $\|x(t) - x^*\|^2 = \|x_0 - x^*\|^2$ decays faster than if there was an equality, which corresponds to a decreasing exponential!

$\|x(t) - x^*\|^2 \leq \|x_0 - x^*\|^2 \cdot e^{-\alpha t} \leq R^2 e^{-\alpha t}.$

Gradient flow converges exponentially quickly to the minimizer, which, if you ask me, is not too shabby. And while this only talks about the distance to the optimum, we can always assume that $f$ is $L$-Lipschitz, so being $\varepsilon$-close to the minimizer means being only $\varepsilon L$ close the actual minimum. Thus

$\delta(x(\tau)) \leq L R^2 e^{-\alpha \tau},$

which is much, much better than the $R^2/\tau$ estimate we had for the non-curvy case.

Unfortunately, not all is perfect in the land of the discrete. We already saw earlier that gradient descent, despite its unbearable adorableness, suffers from discretization errors, and in the non-curved case, only got a $1/\sqrt{T}$ $\delta$eficit bound, whereas the cool and continuous gradient flow got a $1/T$ $\delta$eficit. Sure, that’s only a square root difference, which in terms of exponentials doesn’t really count for anything, but so far nothing ensures us that gradient descent with curvature will be as good as gradient flow with curvature. Luckily, as always, the calculations aren’t too difficult.

Theorem: Let $f$ be $\alpha$-strongly convex and $L$-Lipschitz, and let $\varepsilon > 0$. With the proper choice of $\eta_t$ (which will now really depend on $t$), the gradient descent algorithm can find a point $x$ so that the deficit is small, i.e $\delta(x) \leq \varepsilon$, in just $T \leq \frac{2L^2}{\alpha \varepsilon} +1$ steps!

Before I give the proof, a couple of soothing words are in order, since I’m sure that you are outraged: How can it be that the gradient flow improved from an error of $1/T$ to an error of $e^{-T}$ when introducing curvature, but gradient descent only improved from $1/\sqrt{T}$ to $1/T$?! Well, such is life. The not-so-spectacular increase stems from the fact that while having large curvature makes sure that we take (relatively) big steps when we are far away from the minimum, it doesn’t really say anything about what happens when we are close to the minimum. In that case, there is still always a looming fear of overshooting due to discretization, which means that $\eta_t$ needs to be small. This of course doesn’t happen in the continuous gradient flow, which can enjoy the extra-quick velocity unabashed and unabated. Not all is lost for the discrete gradient descent, however, and it turns out that it is possible to get exponentially good convergence, with just one extra assumption. But that’s already peeking into the future. For now, let’s see the proof, which is only a tad more clever than the original gradient descent analysis.

Proof: Way back when we were younger, we found out that

$\nabla f(x_t) \cdot (x_t - x^*) = \frac{\| x^* - x_{t}\|^2 - \| x^* - x_{t+1}\|^2}{2\eta_t} + \frac{\eta_t}{2}\|\nabla f(x_t)\|^2.$

Since $f$ is $L$-Lipschitz, this immediately reduces to

$\nabla f(x_t) \cdot (x_t - x^*) = \frac{\| x^* - x_{t}\|^2 - \| x^* - x_{t+1}\|^2}{2\eta_t} + \frac{\eta_t L^2}{2}.$

Now we use the fact that $f$ is $\alpha$-strongly convex: The left-hand side is bounded below by

$\nabla f(x_t) \cdot (x_t - x^*) \geq f(x_t) - f(x^*) + \frac{\alpha}{2}\|x_t -x^*\|^2,$

and so

$f(x_t) - f(x^*) \leq \left(\frac{1}{2\eta_t} - \frac{\alpha}{2} \right) \| x^* - x_{t}\|^2 - \frac{1}{2\eta_t} \| x^* - x_{t+1}\|^2 + \frac{\eta_t L^2}{2}.$

It’s time to choose $\eta$. Unlike the vanilla case, we’ll take $\eta_t$ to depend on time, and set $\eta_t = \frac{2}{\alpha(t+2)}$. Plugging this choice in,

$f(x_t) - f(x^*) \leq \left(\frac{\alpha(t+2)}{4} - \frac{\alpha}{2} \right) \| x^* - x_{t}\|^2 - \frac{\alpha(t+2)}{4} \| x^* - x_{t+1}\|^2 + \frac{L^2}{\alpha (t+1)}.$

$= \frac{\alpha}{4} \left(t \| x^* - x_{t}\|^2 - (t+2) \| x^* - x_{t+1}\|^2\right) + \frac{L^2}{\alpha (t+1)}.$

Multiplying both sides by $t+1$, we are nearly there:

$(t+1)\left(f(x_t) - f(x^*)\right) \leq \frac{\alpha}{4} \left(t(t+1) \| x^* - x_{t}\|^2 - (t+1)(t+2) \| x^* - x_{t+1}\|^2\right) + \frac{L^2}{\alpha}.$

Why are we nearly there? Because we are now in a position to sum over all $t$ from $0$ to $T-1$. In the right-hand side we (miraculously and serendipitously) get a telescopic sum, leaving only

$\frac{\alpha}{4}\left( 0\cdot(0+1)\|x_t - x^*\|^2 - T(T+1)\|x_T - x^*\|^2 \right) + \frac{L^2T}{\alpha}.$

Luckily, the first term is multiplied by 0, and the second term is negative; we can safely say then that the right-hand side is quite small:

$\mathrm{RHS} \leq \frac{L^2 T}{\alpha}.$

For the left-hand side, we again use convexity: Since $\sum_{t=0}^{T-1}t = \frac{T(T-1)}{2}$, the weighted sum

$x = \frac{2}{T(T-1)}\sum_{t=0}^{T-1}(t+1)x_t$

is a convex combination of vectors, and so

$f\left( \frac{2}{T(T-1)}\sum_{t=0}^{T-1}(t+1)x_t\right) \leq \frac{2}{T(T-1)}\sum_{t=0}^{T-1}(t+1) f\left( x_t\right).$

Plugging this into our bound finally gives

$f(x) - f(x^*) \leq \frac{L^2 T}{\alpha} \cdot \frac{2}{T(T-1)} \leq \frac{2L^2}{\alpha (T-1)}.$

Setting this $\delta$eficit to be equal to $\varepsilon$ and rearranging finally gives

$T \leq \frac{2L^2}{\alpha \varepsilon} +1$

as promised. □

Some final, short remarks before we carry our descent:

• The keen-eyed reader will notice that there is no dependence on $R$ in this bound! The initial conditions have been forgotten. Technically, this is because we tailored $\eta$ so that the term containing $\|x_0 - x^*\|$ goes away; intuitively, it’s because if we start far away from the minimum then we take a large step.
• There is a dependence on $\alpha$ both in $\eta$ and in $T$. This is not surprising: The larger the $\alpha$, the more curvature the function has, and the smaller we can take $T$. Of course, for $\alpha$ depressingly tiny, perhaps it’s better to use the ordinary gradient descent instead. Please do not plug in $\alpha = 0$.
• Do real problems on the street actually have curvature? They might. Quadratics are abundant both in nature and in Nature. Also, the intrepid intellimancer can always artificially add $\frac{\alpha}{2}\|x\|^2$ to whatever function she originally intended to compute, and hope that minimizers of the modified function are at least marginally related to minimizers of the original function. All is fair in love and optimization.

### Gradient descent with momentum (smooth fella)

Being strongly-convex, i.e having curvature, isn’t enough to guarantee really fast convergence. To avoid overstepping, we must take $\eta_t$ small, which sort of ruins all the fun of taking large steps. One possible solution to this problem (and an interesting property regardless of strong convexity) is to assume that the function $f$ is smooth. This is the formal definition:

A function $f$ is called $\beta$-smooth if $\nabla f$ is $\beta$-Lipschitz. In other words, for all $x,y\in \mathbb{R}^n$,

$\|\nabla f(x) - \nabla f(y) \| \leq \beta \|x - y\|.$

Intuitively, this means that the gradient $\nabla f$ can’t change by too much when performing gradient descent, even if we take have a relatively large stepsize. This might let us take larger steps, at least in some situations.

Technically, we’ll need the following consequence of $\beta$-smoothness:

Proposition: If $f$ is a convex, $\beta$-smooth function, then

$f\left(x - \frac{1}{\beta}\nabla f(x)\right) - f(x) \leq -\frac{1}{2\beta}\|\nabla f(x)\|^2.$

In other words, if we take a gradient descent step with stepsize $1/\beta$, then we reduce our value (i.e get closer to the minimum) by at least $\frac{1}{2\beta}\|\nabla f(x)\|^2$. That’s not too bad!

Proof: The proof is technical, and in my opinion, not all that interesting, so you can skip it. It goes about as straightforwardly as you think it would: We express the difference of $f$‘s values as an integral over its derivative, and use the smoothness property (invoking Cauchy-Schwartz) to say that this must be large.

We first show that for any two points $x,y \in \mathbb{R}^n$,

$f(x) - f(y) \leq \nabla f(y)(x -y) + \frac{\beta}{2}\|x-y\|^2.$

This may seem at first glance to be quite identical to the inequality we had for $\alpha$-strongly convex functions, but don’t be fooled – the inequalities go the opposite way here (isn’t it amazing that we can extract good stuff from two opposing inequalities?). To prove the above formula, note that

$\frac{d}{dt} f(y+t(x-y)) = \nabla f(y+t(x-y))\cdot(x-y);$

thus, by the fundamental theorem of calculus,

$f(x) - f(y) - \nabla f(y)(x-y)= f(y+1(x-y)) - f(y+0(x-y)) - \nabla f(y)(x-y)$

$= \int_0^1 \nabla f\left(y + t(x-y)\right)\cdot(x-y) dt - \nabla f(y)(x-y)$

$= \int_0^1 \left(\nabla f\left(y + t(x-y)\right) - \nabla f(y) \right) \cdot(x-y) dt.$

By the Cauchy-Schwartz inequality,

$\int_0^1 \left(\nabla f\left(y + t(x-y)\right) - \nabla f(y) \right) \cdot(x-y) dt \leq \int_0^1 \| \nabla f\left(y + t(x-y)\right) - \nabla f(y) \| \cdot \|x-y\| dt,$

and since $f$ is $\beta$-smooth, the first term is also small:

$\| \nabla f\left(y + t(x-y)\right) - \nabla f(y)\| \leq \beta \|t(x-y)\|.$

Combining the two inequalities, we get

$f(x) - f(y) - \nabla f(y)(x-y) \leq \int_0^1 \beta t\|x-y\|^2 = \frac{\beta }{2}\|x-y\|^2.$

All that remains is rearrange and plug in the gradient step, using $x -\frac{1}{\beta}\nabla f (x)$ as $x$, and $x$ as $y$ in the above inequality:

$f\left(x - \frac{1}{\beta}\nabla f(x)\right) - f(x) \leq \nabla f(x) \left( x -\frac{1}{\beta}\nabla f(x) - x\right) + \frac{\beta }{2}\| x - \frac{1}{\beta}\nabla f(x) - x\|^2$

$= -\frac{1}{\beta}\|\nabla f(x) \|^2 + \frac{1}{2\beta}\|\nabla f(x) \|^2 = -\frac{1}{2\beta}\|\nabla f(x) \|^2.$

Neat! □

Being $\beta$-smooth is a second order phenomenon – it talks about how quickly the gradient changes, not how quickly the function itself changes. Exploiting this property to directly to get a really good time bound is therefore a tad harder than exploiting properties relating to $f$ explicitly; the algorithm and proof we will see below will be the most complicated in this entire post. On the plus side, this will be a great opportunity to introduce a totally new technique to the our repertoire.

So far, the gradient descent algorithm was very dumb, and in particular, very memoryless: The new position $x_{t+1}$ depended only on calculations performed at $x_t$, and totally disregarded the possibly rich and exciting history of the trajectory. This short-term approach will not do when trying to utilize the $\beta$-smoothness of a function (well, it won’t work naively, anyway). But it turns out that there is a way to meaningfully combine knowledge about previously taken steps together with the current position, and get some strong results.

Let’s think back to the (epic) limerick that started all of these gradient shenanigans, about water flowing down a slope. It is true that water, when situated on a slope, flows downwards. However, it is also true that water, once it starts flowing, has a hard time stopping, and often splashes and crashes in torrents and waves around rocks, trees, buildings, or whatever else stands in its way. This is because it has momentum. Having momentum allows it to cross and gloss over some of the tiny details and landscape features that would otherwise cause it to detour and take a longer path towards the glorious minimum.

Computer scientists, being awed by nature as they are, have tried to imitate this behavior by running gradient descent algorithms which tend to keep going in the same direction that they moved up till now. Intuitively, if we think in physical terms, instead of having the gradient be the instantaneous velocity of a particle, we can use it as an instantaneous acceleration instead. Here is one such algorithm:

Let

$d_t = \gamma_t(x_t -x_{t-1}),$

where $\gamma_t$ is some time-dependent coefficient. The difference $x_t -x_{t-1}$ is basically the instantaneous velocity of a particle whose position is $x_t$, and if we view $\gamma_t$ as a (time-dependent) mass, then $d_t$ can be seen as the monementum of the particle. With this definition, the step of “gradient descent with momentum” is

$x_{t+1} = x_t+d_t - \frac{1}{\beta} \nabla f(x_t +d_t).$

Theorem: Let $f$ be a $\beta$-smooth convex function, and let $\varepsilon > 0$. With the right choice of $\gamma_t$, the gradient descent with momentum algorithm can find a point $x$ so that the deficit is small, i.e $\delta(x) \leq \varepsilon$, in just $T < \sqrt{\beta R^2 / \varepsilon} + 1$ steps!

Proof: For notational convenience (and because that’s how it’s written in my notes), denote $\delta_t = \delta(x_t)$ and $g_t = -\frac{1}{\beta}\nabla f(x_t + d_t)$. Under this notation, we just have

$x_{t+1} = x_t + d_t + g_t,$

and

$\delta_{t+1} - \delta_t = f(x_t + d_t + g_t) - f(x_t).$

Since $f$ is $\beta$-smooth, we can use the proposition that we proved above applied to $x_t + d_t$ instead of $x$. The proposition gives

$f\left(x_t +d_t + g_t \right) - f(x_t +d_t) \leq -\frac{1}{2\beta}\|f(x_t +d_t)\|^2,$

and plugging this into the difference $\delta_{t+1}-\delta_t$ gives

$\delta_{t+1} - \delta_t \leq f(x_t +d_t) -f(x_t) - \frac{1}{2\beta}\|f(x_t + dt)\|^2.$

By good ol’, standard, run-of-the-mill convexity, this is bounded by

$\leq \nabla f (x_t +d_t)\cdot d_t -\frac{1}{2\beta}\|f(x_t + dt_)\|^2 = -\beta g_t \cdot d_t - \frac{\beta }{2}\|g_t\|^2.$

That’s that for the difference between two consecutive $\delta_t$‘s. We can also bound a single $\delta$eficit in a similar fashion, eventually yielding

$\delta_{t+1} = f(x_t+d_t+g_t) - f(x^*) \leq \ldots \leq -\beta g_t \cdot (x_t +d_t -x^*) - \frac{\beta }{2}\|g_t\|^2.$

Now things get a little tricky (but only a little). When we run this momentous algorithm, say for $T$ time steps, what we eventually care about is the final deficit $\delta_{T}$. But the expression above contains lots of $x_t$‘s and $d_t$‘s and whatnots. Our only hope for salvation is if we can somehow get some telescopic sum concerning both $\delta_{t}$ and the differences $\delta_{t+1} - \delta_t$. Since the terms on the right-hand side do not automatically telescope on their own, we’re going to have to help them a bit. For this purpose, let $\lambda_t \geq 1$ be some real numbers that we’ll choose later. Let’s look at the expression $(\lambda_t -1 )(\delta_{t+1} - \delta_t) + \delta_{t+1}$:

$(\lambda_t -1 )(\delta_{t+1} - \delta_t) + \delta_{t+1} \leq (\lambda_t-1)\left(-\beta g_t \cdot d_t -\frac{\beta}{2}\|g_t\|^2\right) - \beta g_t \cdot (x_t +d_t -x^*) - \frac{\beta }{2}\|g_t\|^2$

$= -\frac{\beta}{2}\lambda_t \|g_t\|^2 - \beta g_t \left(\lambda_t d_t + x_t -x^* \right).$

Multiplying both sides by $\lambda_t$ and rearranging the left-hand side a bit, we get

$\lambda_t^2 \delta_{t+1} - \left(\lambda_t^2 - \lambda_t \right)\delta_t \leq -\frac{\beta}{2}\lambda_t^2 \|g_t\|^2 - \beta \lambda_t g_t \left(\lambda_t d_t + x_t -x^* \right).$

Using the Pythagorean fact that $\|a+b\|^2 = \|a\|^2 + \|b\|^2 + 2(a\cdot b)$, the scalar product term in the right-hand side can be written as

$\beta \lambda_t g_t \left(\lambda_t d_t + x_t -x^* \right) = \frac{\beta}{2}\left(\|x_t +\lambda_t d_t +\lambda_t g_t -x^* \|^2 -\|\lambda_t d_t +x_t -x^*\|^2 - \lambda_t ^2 \|g_t\|^2 \right)$

(we took $a = \lambda_t g_t$ and $b = \lambda_t d_t + x_t -x^*$). This has the benefit of cancelling with the norm of $g_t$ when plugging it into the deficit difference, so that

$\lambda_t^2 \delta_{t+1} - \left(\lambda_t^2 - \lambda_t \right)\delta_t \leq -\frac{\beta}{2}\left(\|x_t + \lambda_t d_t +\lambda_t g_t - x^* \|^2 - \|x_t +\lambda_t d_t -x^* \|^2\right).$

We’re getting somewhere! (in fact, we are nearly there). In both the left-hand side and the right-hand side, we have the differences of two terms which are very much similar. We’d be the happiest creatures on Earth if, just like in the analysis of vanilla gradient descent, they could somehow transform into telescopic sums, so that each side would sum up and practically cancel itself out, leaving us with only $\delta_T$ in the left-hand side and some not-too-large expression in the right-hand side. Since we didn’t choose $\lambda_t$ yet, we even have hope for making this happen. But alas, there are two different expressions we need to telescopize, but only one degree freedom in choosing the variables $\lambda_t$. Even if we fix $\lambda_t$ so that the left-hand side cancels out, won’t the right-hand side be a party-pooper and spoil our cancellations?

The situation seems grim, but do not despair. When all seems lost, that is when the Deus comes out of its Machina. For remember, our notation of $g_t$‘s and $d_t$‘s hides the fact that underneath all the calculations floating about, we still haven’t chosen $\gamma_t$, the mass at step $t$! We actually have two free variables to juggle, and this will give us exactly the extra leeway that we need. We’ll first choose $\lambda_t$ so that the left-hand side telescopizes. When summing two consecutive time steps, say $t$ and $t+1$, the coefficients are

$\lambda_t^2 \delta_{t+1} - (\lambda_t^2 -\lambda_t )\delta_t + \lambda_{t+1}^2 \delta_{t+2} - (\lambda_{t+1}^2 -\lambda_{t+1} )\delta_{t+1}$

$= \lambda_{t+1}^2 \delta_{t+2} + (\lambda_t^2 - \lambda_{t+1}^2 + \lambda_{t+1})\delta_{t+1} - (\ldots)\delta_t$

So in order to cancel out the $\delta_{t+1}$ in the middle, we’ll take $\lambda_{t+1}^2 - \lambda_{t+1} = \lambda_t^2$. This is just a quadratic equation, which we definitely know how to solve:

$\lambda_t = \frac{1 + \sqrt{1+4\lambda_{t-1}^2}}{2}.$

So $\lambda_t$ are increasing and of the order $O(t)$ (we basically increase the value by $1/2$ every time). If we set $\lambda_0 = 0$ (and start counting only from $t=1$ onwards), then we even have $\lambda_1 = 1$, and so our promise that $\lambda_t \geq 1$ holds true.

As for the right-hand side, in order to get a telescope, all we have to do is to make sure that $x_t +\lambda_t d_t + \lambda_t g_t = x_{t+1} + \lambda_{t+1} d_{t+1}$. We already consumed our freedom of $\lambda_t$, but using the fact that $d_{t+1} = \gamma_{t+1}(x_{t+1} - x_t)$, the desired equality reduces to

$\lambda_t (d_t +g_t) = (1+\lambda_{t+1} \gamma_{t+1})(d_t +g_t).$

So all we need to do is pick $\gamma_t$ so that $\lambda_t = (1+\lambda_{t+1} \gamma_{t+1})$, i.e

$\gamma_t = \frac{\lambda_t - 1}{\lambda_{t+1}}.$

I’ll admit, it might seem a bit counterintuitive to choose $\gamma_t$ based on $\lambda_t$, since “in principle” the weights $\gamma_t$ are chosen first, as part of the algorithm, while the coefficients $\lambda_t$ are just a construct for our analysis of the algorithm. But that’s how we get good weights! (or, at least, good theoretical bounds for this detached analysis. As for the real world, well, real-life people making real-life choices can pick $\gamma_t$ to be something else entirely, if it suits them).

Anyway, having set in stone both $\gamma_t$ and $\lambda_t$, all the complicated equations above reduce to the simple inequality:

$\lambda_t^2 \delta_{t+1} - \lambda_{t-1}^2 \delta_t \leq -\frac{\beta}{2}\left(\|x_{t+1} + \lambda_{t+1} d_{t+1} - x^* \|^2 - \|x_t +\lambda_t d_t -x^* \|^2\right).$

Summing from $t=1$ to $t=T$ and remembering that $\lambda_{0} = 0$ (very suspicious, isn’t it?), we end up with

$\lambda_T^2 \delta_{T+1} \leq \frac{\beta}{2}\|x_1 + \lambda_1 d_1 - x^*\|^2 \leq \frac{\beta}{2}R^2.$

Since $\lambda_T \geq T/2$ (proof by Mathematica), then

$\delta_{T+1} \leq \frac{\beta R^2}{T^2}.$

Equating this to $\varepsilon$ as in previous proofs gives $T \leq \sqrt{\beta R^2 / \varepsilon} + 1$ as needed. □

That was quite a ride! Of course, this is just the tip of the iceberg, both for $\beta$-smooth functions, and for gradient descent with momentum. The worst-case theoretical analysis often differs from real life, and it turns out you can be pretty smart with choices of time steps. Check out, for example, this impressive cool webpage.

### Gradient descent with both curvature and smoothness

Our descent to madness is nearing its end; there’s only one more rabbit hole I’d like to peek into today. But before that, let’s recap:
• Without any assumptions, gradient descent gets $\varepsilon$ close to the minimum in time $O(1/\varepsilon^2)$.
• For $\alpha$-strongly convex functions, this is improved to $O(1/\varepsilon)$.
• For $\beta$-smooth convex functions, this is even more improved to $O(1/\sqrt{\varepsilon})$.
• However, none of this gets close to the mighty exponential accuracy of strongly convex gradient flow.

It turns out that by combining the assumptions, gradient descent can actually be as good as gradient flow!

Theorem: If $f$ is both $\alpha$-strongly convex and $\beta$-smooth, then there is a gradient descent algorithm which gets $\varepsilon$-close to the minimum in time $\approx \log(1/\varepsilon)$.

Proof: You’re on your own here, kiddo 😉 .

### Teaser trailer: Gradient descent with different geometry

If you reached this point, you must have already converted to the gradient descent deity, offering pious gifts for quick, dimension-free optimization. But it is time to reveal our darkest secret: Gradient descent is merely but a false idol of the Pure True Algorithm, an imitation, a shadow! Here are two harsh examples showcasing its imperfectness:

1) Although none of the error bounds we proved above explicitly depend on the dimension $n$ of the space we are exploring, the dimension can sometimes implicitly sneak in through one of the properties of $f$ such as the Lipschitz constant. Even a seemingly innocuous function can carry quite a heavy Lipschitz constant, as this sort-of-real-world example reveals:

The expert advice problem: Imagine that you are a layman who is clueless about what to do in life (aren’t we all?). In front of you are $n$ experts, labeled $1$ to $n$, who are eager to help you and give you advice. At each time step $t$, you hear out what they have to say, then choose one expert $I_t \in \{1, \ldots, n\}$ to follow. After you have made your decision, you can retroactively see which expert gave good advice and which expert gave bad advice, in the form of a vector $\ell_t \in \{0,1\}^n$, where $\ell_t(i) = 1$ if expert number $i$ was correct at time $t$, and $\ell_t(i) = 0$ if expert number $i$ was wrong at time $t$. A strainingly-contrived scenario which this sort of thing can represent is betting on who will be the winner at the Olympic games. At the beginning of the day you pick up the newspaper to read the opinion of renown sports experts; you then bet with your friends who will win the 100-meter dash; at the end of the day, you either go to sleep satisfied and (moderately) richer, or curse the person who invented physical activities.

Obviously, your goal is to minimize the number times you followed bad advice. Machine learners like to define the average regret at time $T$, which is how much you are sorry for your choices when compared to following the advice of any one particular expert throughout your life:

$R_T = \frac{1}{T}\max_i \sum_{t=1}^T \left(\ell_t(i) - \ell_t(I_t) \right).$

Inddeed, for any particular $i$, the expression $\ell_t(i) - \ell_t(I_t)$ is $1$ if the expert you followed (which is $I_t$) gave bad advice while expert number $i$ gave good advice. Just like in regular life, you want to minimize the regret.

The question is, then, how do you choose $I_t$?

A decent descent solution: Since you want to minimize something (the regret), gradient descent may be the way to go! For $x \in \mathbb{R}^n$, define $f_t(x) = -\ell_t \cdot x$. This is a convex function, being just a scalar product between $x$ and a fixed vector. Using the projection methods we described above, we’ll restrict ourselves to the simplex, which is the convex set of all probability distributions on $n$ items:

$K = \{x\in \mathbb{R}^n | \sum_i x_i = 1, x_i \geq 0\}.$

Even though the function $f$ now depends on time, the gradient descent algorithm works just as well (you can check it in the proof way way back). Starting with $x_0 = (1/n,1/n, \ldots, 1/n)$, we’ll then get a series of points $x_t$ such that

$\frac{1}{T}\sum_{t=1}^T \ell_t \cdot x^* - \ell_t \cdot x_t \leq \frac{RL}{\sqrt{T}},$

where $x^*$ is the minimizer of $-\sum_{t=1}^T \ell_t \cdot x$ in the set $K$. The vector $x^*$ can be viewed as a convex combination of experts – it tells you how much you should listen to each expert, relative to one another. Being the minimizer of $-\sum_{t=1}^T \ell_t \cdot x$, it is actually the maximizer of $\sum_{t=1}^T \ell_t \cdot x$, so the bound above only gets better if we replace it by a fixed expert.

So how to pick the expert $I_t$? Run the gradient descent, and at each time step, treat the vector $x_t$ as a probability distribution on the $n$ experts and pick one of the experts at random according to this distribution. Our bounds then show that in expectation, the regret is small:

$\mathbb{E}[R_T] \leq \frac{RL}{\sqrt{T}}.$

Everything seems fine and dandy, until we remember that we finally have a concerete problem, and we have to plug in the values of $R$ and $L$. The distance between points $R$ doesn’t bother us: The simplex $K$ is contained in a ball of radius $1$, so we certainly have $R<2$. It’s the Lipschitz constant that aches: Since $f_t$ is linear, we just have $\|\nabla f_t(x) \| = \|\ell_t\|$; in the worst case (the best case, actually), when $\ell_t(i) = 1$ for all $i$, we’ll have $\|\nabla f(x)\| = \sqrt{n}$!

This can really, really suck, especially if there is a plethora of experts to choose from (which is true in the sports industry, but also in more specialized problems in machine learning). What’s worse, is that there are algorithms for the expert advice problem which get bounds that scale as the $\log$ of the dimension! (for example, the well-known (among its friends) exponential weights algorithm). How has the gradient forsaken us?

2) We stated earlier that all gradient descent needs is a way to calculate the gradient and a way to add vectors, so that it can potentially work when the objects you are trying to minimize over are infinite-dimensional, like continuous functions. Indeed, addition and calculating the gradient is all that the basic gradient descent requires:

$x_{t+1} = x_t - \eta \nabla f(x_t).$

But this is a bit misleading: The innards of all our proofs also relied on scalar products, which, while very friendly and nice, are not a general property of infinite spaces. If the vectors / functions / objects live in weird normed spaces, whose norm does not originate from an inner product, then all the precious geometric structure on which our arguments rested is lost. Even worse, in such spaces the gradient of a function is no longer a vector, so the very sum itself: $x_t - \eta \nabla f(x_t)$ doesn’t really immediately make sense in such spaces. (For the mathematically inclined: In general normed spaces, scalar products are replaced by linear functionals, and the space of functionals is not necessarily isomorphic to original space.)

As allured, these nearly-fatal flaws can be fixed, resulting in a very general, very powerful, and (eventually) much deeper algorithm, mirror descent. The concept is actually simple: Rather than try to conduct all calculations in the original space, define a “mirror map” $\Phi$ which carries you over to a benign, peace-loving workable space (the dual space), perform all calculations there, and then map back to the original space. So in addition to all the technical tricks and schemes we utilized in ordinary gradient descent (momentum, stochastics, strong convexity…), there is an added layer of complexity in choosing and analyzing the mirror map $\Phi$. The deep ends of this technique can lead to some beautiful mathematics, and also some strong results (at the very least, mirror descent can solve the expert advice problem with a bound of $\sqrt{\log n}$, which is much better than $\sqrt{n}$). But, I think, perhaps this can wait for another post.