New paper on arXiv: A conformal Skorokhod embedding

I’m happy to say that I recently uploaded a paper to the arXiv under the title “A conformal Skorokhod embedding” (https://arxiv.org/abs/1905.00852). In this post, I’d like to explain what the Skorokhod embedding problem is, how I came across it, some previously known solutions, and my own tiny contribution. But really, all of this is just a sorry excuse to post some pretty images which come up along the way. I’ll try to write a fairly self-contained exposition, leaving the sketch of the technical details to the very end.

Anatoliy Skorokhod, looking down upon my work.

Brownian motion and the Skorokhod embedding

Our main player in all of this is Brownian motion. Named after Robert Brown, it is the probabilistic community’s industry-approved gold-standard for diffusions, Markov chains, and random walks. Basically, a Brownian motion path X_t is the position X of a particle as a function of time t , where at each moment the particle’s direction and speed is random and uncorrelated with whatever direction and speed it had before.

If the particle is travelling in one dimension, its path as a function of time might look like this:

In two dimensions, the path might look like this (now it is more difficult to draw the time axis, so different times are represented by color, with blue at the beginning and purple at the end):

One mathematical way of describing Brownian motion is as follows. Suppose that a particle starts at position X_0 = 0 and performs the following simple random walk: Once every second, it jumps 1 unit to the left with probability 0.5, and 1 unit to the right with probability 0.5. Now consider a second particle, which randomly jumps twice per second, but each time it makes a jump of size 1/ \sqrt{2} . We can can similarly imagine a series of particles, with the n -th particle jumping left or right n times per second, and each jump is of size 1/\sqrt{n} . As n\to\infty , the path we see converges to a Brownian motion (sort of).

This random-walk description can also be done in two or three or any number of dimensions. A very nice property of mutli-dimensional Brownian motion is that it is just composed of a bunch of individual, independent one-dimensional Brownian motions – one for each direction.

Mathematicians take quite a fancy to Brownian motion, and for many reasons: Because it is simple (so far as stochastic processes go), because it appears almost everywhere, and because it is the natural scaling limit of random walks, to name a few. But mostly, they like it because they can prove things about it, and the mathematical literature is filled with questions (and sometimes, even answers) about how many times Brownian motions hit a point, or how two such motions intersect each other, or how to integrate, solve differential equations, and make coffee using Brownian motion.

Skorokhod, being a mathematician, was no exception. He was interested in the following question. Suppose you wanted to flip a coin, that is, randomly get either Heads (which we denote by the number “1”) or Tails (which we denote by the number “-1”). You reach into your pocket searching for a coin, but are both amazed and abashed to discover that you only have a Brownian motion X_t on you. Could you somehow simulate a coin flip using X_t ?

The answer is: Yes! Here is Skorokhod’s way of doing it: Since Brownian motion is symmetric, meaning it has equal probability of being at either x or -x at time t , we can wait until |X_t| = 1 , that is, until the motion “hits” either +1 or -1 . By symmetry, there is a 50% chance of getting either result!

Doing something like the above is called stopping the Brownian motion, and the time at which we stop it is called a stopping time. Note that this is a random time, which actually depends on the randomness imbued in the Brownian motion: For one particular run it may be that X_t has magnitude 1 quite quickly, and for other runs it may be that X_t takes a long while to get there.

Skorokhod was interested in a generalization: Suppose that instead of just flipping a coin, you wanted to simulate some other, general, probability distribution. Can you simulate a continuous distribution, such as a Gaussian distribution? How about a discrete one, like a geometric distribution? Or what about nasty, everywhere-non-differentiable yet non-atomic distributions, demons from beyond the depth of Hell for which humankind has no name? (I’m looking at you, Mr. Every-Rational-Number-Has-Positive-Weight-Distribution)

The cumulative distribution functions of Gaussian and Cantor distributions. Not for naught, the latter is called the “Devil’s staircase”

The answer is: Yes! (with one small caveat on the variance) And you can do it by using stopping times, too:

Theorem: Let \mu be a probability distribution with zero mean and finite variance. Then, given Brownian motion X_t , there exists a (random) stopping time T so that X_T distributes like \mu .

This theorem is often called Skorokhod’s embedding theorem (which to me always sounded like a Dungeons and Dragons spell), probably because it embeds some other distribution inside Brownian motion.

Over the years there have been many proofs and generalizations of this theorem. Below are my two favorites (which you can (but shouldn’t) skip if you accidentally just want to see my solution (gosh, I’m blushing!))

Two cool solutions

First cool solution

The first cool solution is by Dubins. We already know when to stop Brownian motion when we want to get the \pm 1 Bernoulli distribution, i.e when we want to get 1 with probability 1/2 and -1 with probability 1/2 : You just run the Brownian motion until |X_t| = 1 . But what would you do if you wanted (say) to get a distribution which attains the values -3,-1,1,3 , each with probability 1/4 ? Well, Dubins has an answer for you!

Dubins says this: First, run the Brownian motion until |X_t| = 2 , and mark the time you stopped as T_1 . By the same reasoning as the \pm 1 coin flip, we will have that X_{T_1} is 2 with probability 1/2 , and -2 with probability 1/2 . Now run the Brownian motion again, from wherever it is that you stopped it, and notice this: If at time T_1 it started at position 2 , then 1 and 3 surround it just like +1 and -1 surround 0 , and we already know how to solve that problem: Just run the motion again until it hits either 1 or 3 , and denote the time it took by T_2 . Similarly, if at time T_1 the motion was at -2 , then stop it once it gets to either -1 or -3 .

Actually, since 1<2<3 and -3<-2<-1 , we can phrase our Skorokhod stopping time as follows: First, let the Brownian motion run until it hits either 2 or -2 . After it does this (and only after it does this), run it again until it hits one of the numbers {-3,-1,1,3} . So the overall stopping time is the sum of two individual stopping times: T = T_1 + T_2 .

This argument generalizes quite naturally to almost any probability distribution. Here I’ll show a continuous example. The following image is the density function of some distribution (a continuous histogram of the distribution, if you will):

The distribution can be divided into two parts: The positive numbers, and the negatives. We can compute the normalized mean of each part, that is, compute \mathbb{E}[X | X > 0] and \mathbb{E}[X | X < 0] . Overlaid on the density function, they lie somewhere like this:

So as a first step, run the Brownian motion until it hits one of the pink lines. For the sake of this example, let’s suppose that you hit the positive side.

Having done that, we now restrict ourselves, and say: Wherever else the Brownian motion will go, we will never return to the negative number range. The negative numbers are dead to us! (because we will make sure to always stop before we reach them). The remaining probability mass can be normalized, giving us a new distribution:

We can repeat this process again. If we hit the left barrier this time,

then we are left with the following:

Repeat at leisure. You can leave it to Dubins to prove that this process converges, and the final stopping time will be equal to infinitely many, shorter and shorter stopping times,

T = \sum_{i=1}^{\infty} T_i.

You can read Dubins’ original paper here.

Second cool solution

The second cool solution is by Root. Root also generalizes the \pm 1 Bernoulli solution, but in a different way. He basically says the following: Let’s look at the graph (t,X_t) . Since we stop the Brownian motion the moment its magnitude becomes 1 , we can think that the two infinite lines ((0,\infty), 1) and ((0,\infty), -1) form a “barrier” to the Brownian motion:

The stopping time is then just first time that the Brownian motion (t,X_t) hits the barrier lines. Root then had the idea that perhaps every distribution can be obtained by looking at the first time that the graph of the Brownian motion hits some (perhaps very contorted) barrier.

And he was right! I won’t go into the details, but the gist of Root’s argument is this. First he showed that such barriers exist for every atomic distribution, that is, every distribution which is supported on a finite number of numbers:

Once he had his finite collection of lines, Root proved that when taking a limit of finite atomic distributions which converge to a general distribution, the barrier converges as well.

You can read Root’s original paper here.

How I stumbled upon the problem

To be frank, a couple of months ago I was just an ordinary bloke, completely unaware that Skorokhod even had a problem with embeddings, not to mention that he solved it. I was working at the time on generalizing an approach used by my advisor, Ronen Eldan.

You see, Eldan, being a mathematician, also loves Brownian motion, and so much so that he uses it to manufacture practically all his other distributions, often with great success. One distribution which repeatedly comes up is the “uniform distribution on the Boolean hypercube”. This is a fancy name for the distribution on vectors of the form \{-1,1\}^n = (\pm 1, \pm 1, \ldots, \pm 1) , where each vector has the same probability, namely, 2^{-n} . Geometrically, each such vector x = (x_1,\ldots, x_n) is actually a vertex in the n -dimensional Boolean hypercube. You can think of the uniform distribution on these vertices as a sequence of n individual and independent coin flips, where each flip is either +1 or -1 with probability 1/2 . Sounds familiar? Sure does.

So, naturally, Eldan generates the n numbers by running n independent Brownian motions (X_t^1,X_t^2,\ldots,X_t^n) ; whenever one of them reaches either 1 or -1 , he stops it. When the last Brownian motion is stopped, then all coordinates together are a uniform point on the hypercube. The nice thing about this construction is that it has a pretty geometric interpretation; like many things in math, you have to draw it to appreciate it.

It all relies on a fact we stated at the beginning, that n -dimensional Brownian motion is actually comprised of n independent, 1 -dimensional Brownian motions. So we may treat Eldan’s n independent Brownian motions as one multi-dimensional motion, living in \mathbb{R}^n . This Brownian motion starts at 0 and wanders about until it hits one of the faces of the hypercube (i.e one of its coordinates is either 1 or -1 ). The moment it does that, it “sticks” to that face, since that particular coordinate doesn’t move anymore. It then continues its journey, confined to a space of one dimension less.

Here’s a representation of this process in three dimensions (so we are going to generate a random vector with three components of either +1 s or -1 s). We start the motion in the three dimensional hypercube (which, in this case, is just the good-ol’ ordinary cube):

After the Brownian motion hits the rightmost face, it is confined to it and continues its journey on it:

Finally, after hitting one the bottom edge, we are left with a one-dimensional Brownian motion, with which we are quite familiar:

Note that unlike before, the Brownian motion doesn’t have to start at 0 at each successive stage. For example, in this particular case, the Brownian motion hit the face of the cube at the point (0.47, 1, 0.32) , so the two-dimensional motion effectively started at (0.47, 0.32) .

Here is the entire trajectory:

The reason that Eldan even deals with all of this high-dimensional geometry stuff, and doesn’t just take the naive perspective of “n -independent tiny-weeny Skorokhod embeddings”, is that he actually takes this simple Brownian motion and adds drift to it, and then looks at geometric properties of that drift and where it makes the motion go. So geometry is crucial here. I wanted to generalize Eldan’s work so that adding this drift term would make sense not only in the discrete Boolean hypercube situation, but for more general distributions, namely: a) any discrete distribution on a finite number of elements; b) uniform / continuous distributions on [-1,1] . I thus needed to find a geometric structure which, when Brownian motion hit it, would give me some particular distribution that I desired. And so, as it turns out, Skorokhod’s embedding problem arose quite naturally (although I didn’t know it by its name when I started).

My solution

Like Root’s solution, my approach also involves looking at some geometric barrier which the Brownian motion is supposed to hit. But while Root barrier inspects when the graph (t,X_t) hits the barrier, my solution instead looks at when a two-dimensional Brownian motion hits a barrier. As mentioned above, two-dimensional motion is made up of two independent one-dimensional motions, (X_t, Y_t) , so we actually have to use another process, Y_t along the way. In other words, my solution requires having additional randomness laying about. This can be seen as a slight negative – we have a natural desire for nice, clean solutions which do not use external resources. But my randomness dealers are going to have to live with it, I guess. On the plus side, my method can give some explicit examples of barriers which Root could not do, using entirely different tools.

There are two main ingredients in my construction. The first is a practical result from probability and statistics, called the “inverse sampling method”, which can be seen as a baby, non-time-dependent version of the Skorokhod embedding problem. Suppose once more that you want to flip a coin (getting the value 1 and -1 both with probability 1/2 ). You reach into your pocket, but are again disappointed when all you come up with is the uniform distribution U on [0,1] . But once again, not all is lost, and you can easily simulate a coin using a random variable U : If U \leq 0.5 , treat it as if you flipped -1 , and if U > 0.5 , treat it as if you flipped 1 . Since U is uniform, both events are equally likely.

As (almost) always in mathematics, this method can be applied to general distributions \mu , not just to silly coin flips. The key is to look at the inverse cumulative distribution. A quick reminder: If X is a random variable distributed according to \mu , i.e X \sim \mu , then it has a cumulative distribution function F_\mu , which describes the probability of being smaller than some number:

F_\mu(x) = \mathbb{P}[x \leq X].

For a coin flip, the cumulative distribution is quite simple: The probability of being strictly less than -1 is always 0 , the probability of being between -1 and 1 is 0.5 , and the probability of being at least 1 is always 1 . The cumulative distribution function looks something like this:

This is an increasing function, so we can take what is called its “generalized inverse”, F^{-1}_\mu , which is obtained by rotating the graph of the function about the diagonal (and ignoring vertical lines stretching out to infinity):

You must be careful with possible discontinuities and infinities, of course, but those are rather technical details which we will not touch.

If you stare long enough at the inverse of the cumulative distribution function of the coin flip, you might notice something peculiar: It attains the value -1 for half of the possible inputs, and the value 1 for half of its possible inputs (we ignore single, isolated points). It essentially encodes the coin-flip-simulation scheme we described earlier! Instead of “looking at when U is smaller than 0.5 or larger than 0.5 ”, we can just plug it in to F^{-1}{\mu} and output the result F^{-1}\mu(U) .

The statement of the inverse sampling method is that this trick works for all distributions \mu , and not just coin flips. (Slightly) more formally, for every distribution X \sim \mu , if U is a uniform random variable on [0,1] , then

X \sim F^{-1}_\mu(U).

The second ingredient in my construction is called conformal invariance, which is a fancy name for “if you rotate something that is rotation invariant, it stays the same”. I’ll explain (but won’t go too much into the details). The plane \mathbb{R}^2 is often identified with the complex plane \mathbb{C} = \mathbb{R}+i\mathbb{R} , so that planar Brownian motion (X_t, Y_t) can be thought of as a complex process X_t + i Y_t . One common thing mathematicians like to do with the complex plane is deform it, i.e apply a transformation f:\mathbb{C} \to \mathbb{C} . The well-behaving transformations are called “conformal”, and they have the peculiar property that locally, all they do to a small square is rotate it and scale it a bit.

From Needham’s fantastic book, “Visual Complex Analysis”

Planar Brownian motion is rotationally invariant – it is equally likely to go in any one direction as the other. So it really doesn’t care if you rotate it at all. And since Brownian motion is self-similar (in that, if you zoom in on it, it looks the same), multiplying the motion itself by a constant is the same as “speeding up or slowing down time”: If you blow up the motion so that points that were very close to the origin are now very far, it’s equivalent to boosting up the speed of the motion, so that in a small amount of time it can zip across the plane.

Summing things up, since conformal transformations locally only scale and rotate, they don’t really do anything to Brownian motion: The rotation is shrugged off, and at most, the scaling can make it go faster or slower at some point. That’s the content of the phrase “Brownian motion is conformally invariant”: If X_t is a Brownian motion and f is a conformal complex function, then f(X_t) is also a Brownian motion Y_{s(t)} , where s(t) is a monotone function of t .


Alright! It’s time to combine the pieces. We start with the observation that if we run Brownian motion until it hits a circle of radius 1 around the origin, it will be uniformly distributed on it. This is a direct consequence of the fact that Brownian motion is rotationally invariant. Thus, letting T be the hitting time of the circle, we have that the angle of X_T is uniform in the interval [0,2\pi] .

The inverse sampling method says that if U is uniform in [0,1] , then F^{-1}{\mu}(U) distributes as \mu . So if we apply a function f_x: \mathbb{C} \to \mathbb{R} which takes a complex number z as input and returns f_x(z) = F^{-1}_{\mu}(\mathrm{angle}(z)/2\pi) for z on the unit circle, we would have that

f_x(B_T) \sim \mu,

which is the distribution we want to get!

However, these are not the rules of the game. We can’t just apply any function we want to the point after we have stopped – the whole point of Skorokhod’s theorem is to find a time T so that no further modifications to the result will be needed. We want X_T itself to distribute as \mu , not some function of it.

Here steps in the conformal invariance. Instead of applying just the function f_x (and getting a real number), we’ll try to find a complex-valued conformal function f(z) = f_x(z) + i f_y(z) . The function f should be defined on all of the unit disc D = \{z \in \mathbb{C} \mid |z| < 1\} , and satisfy that on the unit circle itself, the real part of the complex number that it returns is equal to f_x(z) . The imaginary part, f_y(z) , is basically just there in order to make the function conformal. If we apply this new f to X_T , we’ll get that the real part (i.e, the x component) of f(X_T) distributes as \mu by construction. The imaginary part (i.e the y component) does not, but we don’t really care about it. In general, such an f will map the unit disc D into some other, perhaps complicated shape:

Now we are in good shape! Since f is conformal, by conformal invariance f(X_t) is also a Brownian motion! Hence, instead of looking at when the original X_t hits the circle, we can look at when the new motion Y_{s(t)} = f(X_t) hits the new boundary of f(D) . In this way, we circumvent the fact that we cannot modify the result after we stopped the motion – if we find a function f so that applying it does not change the law of the motion, then looking at its image is the same as looking at the original motion.

That’s basically all there is to my plan of attack: Find a conformal f so that the real part of f(X_T) distributes as \mu . This amounts to finding a conformal f which maps the unit circle to F^{-1}_\mu in some way. Showing that such an f exists is not hard, but does require a tiny bit of work. It’s not very pretty work, but luckily, the images that result from the conformal motions are. Here are examples of the domains for the two cumulative distribution functions we saw way back in the beginning:

Gaussian and Cantor domains. Vertical lines in the Cantor domain should actually stretch to infinity, but your screen is too small to show this.

I won’t go deep into the details of how to obtain the desired conformal function f ; that’s what the paper is for. But here is a rough sketch:

Instead of working with F^{-1}{\mu} directly, I work with a normalized, symmetrized version of it, which in the paper I denote as \varphi(x) = F^{-1}{\mu}(|x|/ \pi) . This is an even function defined on the interval (-\pi, \pi) . Like pretty much all functions you’ll encounter in the wild, it can be decomposed into a Fourier series, and because it’s even, it contains only cosines:

\varphi(x) = \sum_{n=0}^{\infty} \widehat{\varphi}(n) \cos nx.

From the Fourier coefficients \widehat{\varphi}(n) we build our conformal map:

f(z) = \sum_{n=0}^{\infty} \widehat{\varphi}(n) z^n.

This construction takes advantage of the fact that the Fourier series and the power series agree on the unit circle. Indeed, for z = e^{i\theta} on the unit circle, the real part of f satisfies

\mathrm{Real}f(z) = \mathrm{Real} \sum_{n=0}^{\infty} \widehat{\varphi}(n) e^{i n \theta}

= \sum_{n=0}^{\infty} \widehat{\varphi}(n) \cos n \theta

= \varphi(\theta).

So if the angle \theta is distributed uniformly, we indeed get that the real part of f(e^{i\theta}) distributes as \mu (notice that \varphi is not an exact inverse of F_\mu – it actually gives each point “twice the probability”, because of the absolute value we used when defining it, but since this factor of two is the same for all points, it cancels out). A bit of work (with complex analysis) is then required to show that f satisfies all the exact properties we need.


Leave a comment