I’m happy to say that my advisor Ronen Eldan and I somewhat recently uploaded a paper to the arXiv under the title “Concentration on the Boolean hypercube via pathwise stochastic analysis” (https://arxiv.org/abs/1909.12067), wherein we prove inequalities on the Boolean hypercube using a cool continuous-time random process.
In the previous post, I pretended that I had the proper qualifications and gave a brief (if not brave) introduction to classical Boolean analysis. This post will be a lot more stochastic, and will explain the main ideas behind our paper.For your stochastic convenience, here is a small table of contents.
Probability distributions on the hypercube play an important role in Boolean analysis: Many of the properties of Boolean functions are expressed using probabilities and expectations. For example, the influence of coordinate of a function , which is a measure of how important the -th bit is in determining the value of , is defined by
where is uniformly random point on the hypercube. Similarly, the variance of a function is given by
where the symbol indicates expectation over a uniformly random . A lot of theorems can then be thought of probabilistically: The KKL inequality, for example, which says that
certainly has its fair share of uniformly random vectors running around.
When analyzing these types of inequalities, you can approach this use of randomness from several directions. The first, and simplest, is to just bluntly say “Ok, I have a uniformly random vector , and that’s that!” and see what you can learn. After all, generating a uniformly random vector on the hypercube is easy: Just flip independent coins. But as it turns out, sometimes you can learn a lot about the expectations and random quantities you seek to bound, by thinking of randomness as a process which evolves over time, with the uniform measure as the final goal.
For example, instead of flipping all coins at once, you can think of a discrete process where you flip one coin at a time, effectively “revealing” the bits of the uniform vector one by one. At time , you reveal bit , and see what you can learn just from revealing this particular bit. Perhaps you can prove that some quantity can only increase when you reveal a new bit, or that in expectation it increases. Perhaps you can learn something about the individual influences, since now you are in a better position to talk about the effect of individual bits. After steps of doing this, you’ll have the complete, uniformly random vector in your hand, and so you’ll be back to the case described in the previous paragraph, but perhaps you’ll be a little smarter.
This kind of approach is already quite old, and has put much bread on mathematicians’ tables, including my advisor’s table (and also, evidently, mine). Our new paper uses the same basic idea – to generate “more and more randomness” until reaching the final distribution – to learn something about Boolean functions. However, the way we generate this randomness is a bit more sophisticated than simply revealing one coin flip at a time. Whereas the “revealing” process is discrete – it always takes steps to finish – our method uses a continuous time stochastic process, one that depends on a real-valued time parameter , to create the randomness. This random process, which we denote from here till the end of time, is in general a vector in the continuous hypercube . At time , it will always be equal to , and at time , it will always be uniformly distributed on the corners, i.e on the discrete hypercube . And at times in the middle? Well, at those times it will do all sorts of wacky, but analyzable, things. The main idea is that by inspecting the individual paths that the process takes, we will be able to say, “hey! With high probability, somewhere in the middle, the influences are larger than the variance!”, or similar statements about other quantities which relate to . Knowing that something happens at an intermediate time will let us say something meaningful about the same quantities at the end, when and when the process just distributes uniformly; but those are precisely the quantities that theorems like KKL like to bound.
Jumping in the cubeSo how do you create a process that’s always contained in the continuous hypercube , at time it is , and at time it is uniformly distributed on the discrete hypercube ? First, to make life easier, we’ll search for one where the individual coordinates are independent and identically distributed; in other words we’ll set , where the s are each independent copies of some one dimensional process . In this case, the problem reduces to finding a such an , so that , and at time , is just a coin flip:
But again, what happens in between? Generally, we’re allowed to do anything we want in between: We could make just sit there at doing nothing until time , and then spontaneously jump to ; we could make it move about chaotically all over the interval , eventually stitching it so that it reaches its final destination in time. But none of these options are any good for us, if we wish to analyze the behavior of at intermediate times: In the first case, there is nothing at all to learn since the process does nothing, while in the latter case, there is nothing at all to learn since the process does too much. As in all things in life, we must balance randomness and order.
Here is a good way to do this. First, we prevent from whizzing about the interval, by forcing its magnitude to increase; specifically, we demand that for all . This means that “chooses a direction”, either or , and then walks a short bout in that direction. However, it does not mean that it is deterministic: At any given time, could decide to change its mind, and reverse its position, i.e jump from to , or from to . In this way, we remove all randomness from the magnitude ; all the randomness of is then only found in its sign, and when it decides to flip it. There are still infinitely many ways to choose this randomness, of course, but there is one particular way which really, really helps analysis: Forcing to be a martingale.
In general, a martingale is a random process which, on average, doesn’t move. Very informally: If you know exactly where is at some time , then the expected position at all times is still . A bit more formally, we write this as
The theory of martingales is a wide, strong pillar of probability theory, and many problems in life can be solved by “finding the right martingale”. This is because there are many theorems which describe how wildly martingales can fluctuate. If we can manage for to be a martingale, we’d be in a very good position for proving things about it later on.
Of course, I wouldn’t be telling you this if it weren’t possible. It turns out that you can indeed make into a martingale, and in fact, since we demand that , there is only one way to do this!
Theorem: There is a unique stochastic process such that
- for all .
- is a martingale.
The process is unique, because these two conditions tell us exactly what the probability is to jump at any given time. Suppose that is positive at some particular time ; the first item then tells us that . At time , either keeps the same sign, so that , or it flips its sign, so that . If is the probability that the sign changes, then the expected value at time given that we know that , is
But is a martingale, so , which gives
Solving for , the probability of a jump is
So the probability of jumping in a small interval of size is pretty much , and, informally, if we send , we may think that the “rate of jumps” at time is , or better yet, the “density” of jumps is : The probability of jumping at any single particular point is 0, but for every finite interval we have some probability of having a jump. This is very much like what happens for continuous random variables: For example, if is a uniform random variable in , then the probability that is for any given , but has a density, and certainly has a positive probability to appear in any subinterval of . Also, do not be alarmed by the fact that this rate goes to as ; all it means is that in intervals of the form , the process jumps infinitely many times. This is perfectly fine, thank you very much, and is happy to do so. In any case, if you look at intervals of the form , in those intervals the process jumps only finitely many times.
Here are some examples of how the different realizations of might look like. Of course is a random process, and jumps at random times, so I had to use a random-number generator to draw them.
And, just to showcase some pretty colors, here are some examples of the -dimensional process , for : Each color represents a different sample path, and each path is just a collection of three independent copies of .
For the connoisseur: If you know anything about fish, then you might smell a Poisson point process in the above construction. And you’d be on the nose! The process is essentially a Poisson process with jump rate .
You can see that this method of constructing randomness is quite different than revealing coin flips one at a time. If we look at at time , we never see a complete “uniform point on the discrete hypercube”. Rather, we get some point which “hasn’t quite decided” what it’s going to be when it grows up. If a coordinate is equal to , then it is currently “in favor” of eventually evolving to the value . However, there is always a chance that it will flip its value to , in which case it will be in favor of eventually evolving to the value . The larger the value , the more certain each bit is in what its going to be: If we peek at time , there is equal probability that and that . However, if we know that at time , is positive, then with high probability it won’t jump in the remaining seconds, and will continue in its current course until . The analysis we perform in our paper utilizes this fact by looking at the value of on these “haven’t quite decided yet” points and saying something about how behaves when the points reach their final destination. (“Ah!” you say, “but is a Boolean function and expects points on the discrete hypercube, not some hesitant scoundrels”. Read on to see how to get around this!)
Wielding the process
Now that we have a randomness-generating process , the first thing we have to do is connect it somehow to the properties of that we want to bound. To do that, we refer to the ancient mathematics proverb, “Have you tried using Fourier yet?”. Every function can be decomposed into a sum of monomials, known as the Fourier representation:
The basic magic of this formula says that for functions mapping to , even though the Fourier coefficients are generally fractions, whenever you plug in a point , you will always get either or . But this formula is also a great way to interpolate the value of for points in the continuous hypercube: You just plug in the values of into the monomials as usual, and that’s it. In this way, we can compose with the random jumping process described above, giving us a new, random jumping process, which we denote :
Here are some examples of how this behaves for the majority function; each color is a different sample path:
Since , we always have
whereas at time , we always have . Of course, since is uniform on the discrete hypercube, we can also extract the expectation of by
and similarly for the variance:
It is important to distinguish between two different types of expectations here. In Boolean analysis, when we write , we usually mean “sum over all points in the discrete hypercube and divide by “. This is just an expectation over the uniform distribution on the hypercube. But whenever we involve , we actually use an infinite amount of randomness under the hood: There are infinitely many possible paths that can take, and there are always paths with as many jumps as we could possibly want. So expectations of the type actually go over all the infinitely many paths and scattered jumps that can go through, and average those. Luckily for us, we just happened to choose so that is uniform anyway, so all the complexities in the paths are smoothed out when we take ordinary expectations. The real fun will start once we actually look at specific properties of specific paths.
One cool thing about the process is that it can be seen as a sort of “fine grained” noise. Recall that for , the noise operator is defined by
where is -correlated with . We saw in the previous post (or, perhaps, you saw in kindergarten) that the Fourier coefficients of are related to those of by a power of :
If we consider as a “not-quite-sure-where-I’m-gonna-be” point on the hypercube, it too leads to a noising effect: For if we stop to look at at some intermediate time , then is correlated with the signs of at time . In other words, if we know that , then it has a higher probability of reaching than , so is correlated with . The opposite is true if we know that , so taking expectations of the form
gives us exactly the definition of , where , and ! (ok, showing that requires a short calculation, but one that even the faint-of-heart will like). Each path of is like choosing a specific which is -correlated to some , and taking expectations is akin to looking at all -correlated vectors of all possible points , i.e the expectations over the noise operator. So using , we can “simulate” -correlation, and thus replicate hypercontractivity results – just look at the norm and compare it to . I guess some of us would consider it very reassuring that, after 5 pages of definitions and calculations, we finally see that in using we haven’t lost any power over the conventional, hypercontractivity methods. Very reassuring.
But in fact, can help us for more than just noise-operating. Recall that is a martingale, i.e for all ,
Since its entries are independent, and since is just a sum of monomials of , it turns out that is also a martingale:
This is a Good Thing, since for all their randomness, martingales are relatively well behaved creatures. For example, for every path of , we can define something called the quadratic variation, denoted , which in our case is the sum of squares of jumps of until time :
This too is a random process, which also increases in jumps – it increases whenever makes a jump. If the sample path of is extremely benign, and has almost no jumps at all, then the quadratic variation will always be small. But if is a wild beast and leaps all over the place, then will eventually reach great heights. But the shtick is that for martingales, the variance of is always equal to the expected value of the quadratic variation at time :
This is a wonderful theorem. Since the Boolean-function-wise variance is just equal to the stochastic-process-wise variance , this gives us an alternate description of the variance:
But what are the jumps of ? The process can only jump if jumps, and only jumps in one coordinate at a time (I mean, come on – what are the odds that two coordinates jump at exactly simultaneously?). And when coordinate jumps, the value of changes by (it goes from to ). All the other values stay the same, and so the size of the jump is
Plugging this in, the variance of is bounded by
This should smell oddly familiar to all of us (you included) – the field of Boolean analysis is littered with equations comparing the variance to the influences (after all, this is the essence of inequalities like Poincaré and KKL), and the influences are in general given by . So all the components we know and love are here, but in a more continuous form, and involving this weird summation over “jump times” thing.
Luckily, we can get rid of this awkward jump set. The expectation in the expression means that we go over all possible sample-paths of , ask “Please tell me all the points where this particular path jumped”, and then take the squares of the jumps. But we could perform this sum another way: Instead of going over sample paths, we could go over all points , and ask “Please give me all sample-paths which jumped at this point”, and then sum up those jumps. This can be done since we know the density of jumps at a given point – it’s just ! – and we can integrate over this rate. Eventually we get
If you’re an integral-type person, this already must be much nicer for you than that awful sum over jumps. But even if you are not prejudiced towards the continuous, this form certainly has its merits – now we can use all sorts of real-valued analytic tools to bound the right hand side.
Let’s prove something!
The fiery-hearted among you must already be wriggling in their chairs, demanding to see what all this is good for. Let’s prove something!
Boolean analysts are humble; all a Booleanist wants to do in life is to prove inequalities between different properties of Boolean functions. Perhaps the most famous inequality is the KKL inequality, proven by Kahn, Kalai and Linial in 1988:
Theorem (KKL): Let be a Boolean function. Then
Here is some fixed constant, and we don’t really care about its value; in fact, throughout the rest of this post, will represent some universal constant whose value can change from line to line (and even in the same line!), which will always be large enough or small enough to suit whatever needs we have at the moment. They sure don’t make constants like they used to.
The KKL inequality is tight for a function called “Tribes”: This is a balanced function (so ) where every influence is of order . The right hand side of KKL is then
which is indeed times the variance. However, there are other functions for which the two sides of the KKL are very far apart. For example, the majority function, where the influences are all of order . In that case, the right hand side is of order , while the left hand side is still !
Luckily, there are other inequalities in the world, and as every analyst soon enough learns, many of them were discovered by mathematician Michel Talagrand. In a paper from 1993, he showed the following.
For a point , let be the number of neighbors of in the hypercube on which takes a different value (in other words, , where is the same as but with the -th bit flipped).
Theorem: For every ,
I won’t get into the “why did Talagrand even look at ” here. Talagrand’s inequality is incomparable with the KKL inequality: It’s not tight for the Tribes function, but it is tight for majority (this is an exercise for the enthusiastic reader). I’m not getting into the details of all this because what’s important to us is that Talagrand conjectured that an even stronger (and even more weird-looking) version of it holds, one that implies the KKL inequality when the variance is constant:
Conjecture: For every ,
This conjecture is stronger than the original theorem whenever the sum of the squared influences goes to 0; in this case, the can become quite large, and so it’s more marvelous that still manages to be larger than the right-hand side.
As luck would have it, by inspecting the behavior of the process for times , my advisor and I were able to show that Talagrand’s conjecture holds true. Let’s see how (in general strokes).
The first thing to do when presented with a new inequality is to see how the various quantities can expressed using the process . In this case the interesting suspect is , whose definition is
For any given , we get a contribution from coordinate only if ; and since in this case, we can write
Note that the norm here is just the Euclidean -norm of the -dimensional vector . We actually want to bound the expectation of , though:
We already know how to relate expectations of quantities to , and so we define a new process, which we call :
Our mission, should we choose to accept it, is to bound from below!
To do this, we use the fact that is a submartingale. Perhaps a bit confusingly, a submartingale is a process such that for all ,
(You’d think that the “sub” in the “submartingale” means that it’s always smaller than a martingale, but it’s the other way around. Then, when you’re lecturing and want to define it, you get confused if it’s the other way around from the other way around, or if it’s the other way around from the original way around, generally resulting in all-around confusion. But it always clears out, in the end).
To show that is a submartingale, we use use fact that is a martingale (and this is no different from showing that is a martingale, so if you were alright with that, you’ll be alright with this). Then we notice that is a convex function, and so we can apply Jensen’s inequality.
The submartingale property is very useful: If we ever find an explicit time so that is large, then we know that in conditional expectation, will also be large! And if it so happens that with high probability there exists an explicit time so that is large, then we can condition on this event, and show that the overall expectation will be large. More formally, we define a random stopping time:
In words, for every sample path of , we let be the first time that was larger than a constant times the horrendous square root. If for that particular sample path there was no such time, then we just set to be . This is a random quantity: For every sample path it will be different. So if we can show that with probability at least the event occurs, i.e there was a time such that is large, we’d totally win: For then,
and we’d be done!
Note that this technique makes use of the trick that I’ve alluded to throughout this entire post – that using , you can analyze processes like or for times , and in fact look at properties of individual paths in order to say something meaningful about what happens at time . Using the relation of to -correlation, noising and hypercontractivity, one way of looking at this technique is as “different-time noising”. That is, we do not pick just one single and apply it to , but rather, we pick a distribution of ‘s, one for each sample path of , which comes from . The fact that is random effectively means that for every path we apply a different quantity of noise.
I won’t go into too much detail about how to prove the statement that , since the technical details take several pages in the paper, and this post is long enough as it is. Suffice it to say, that the proof also uses path-wise analysis, inspecting how individual paths behave. Very briefly, the argument goes as follows:
- Recall that using the quadratic variation, we showed that
Using similar methods, it’s possible to show that the expected value of the quadratic variation of only those paths which always have small gradient is also equal to a similar integral. This integral is then bounded using various analytic techniques. Eventually, this gives:
- On the other hand, by inspecting how likely large jumps in individual paths are, we show that the quadratic variation has a high probability of being large, regardless of whether or not the gradient was small all the time:
Combining these two facts shows that there cannot be too large a probability for getting paths where the gradient is always small; there is always a non-negligible probability of getting a sample path with a large gradient hiding somewhere. This effectively says that with large probability (here “large probability” is ““).