Catastrophic cubic crash course

I’m happy to say that my advisor Ronen Eldan and I somewhat recently uploaded a paper to the arXiv under the title “Concentration on the Boolean hypercube via pathwise stochastic analysis” (https://arxiv.org/abs/1909.12067), wherein we prove inequalities on the Boolean hypercube using a cool continuous-time random process.

That’s quite a mouthful, I know, and quite unfortunately, Boolean analysis is not a standard part of any math curriculum that I know of, be it at the undergraduate, high school, or kindergarten level. In this post I explain a bit about what Boolean functions are and what sort of games you can play with them; a sort of catastrophic cubic crash course, if you will. The next post will talk about our actual result.

For your Boolean convenience, here is a small table of contents.

George Boole. Luckily for him, he would never dream of our stochastic atrocities.

Boolean functions

A Boolean function is a function which takes as input a vector of $n$ bits $x = \left(x_1, x_2, \ldots, x_n\right) \in \{-1,1\}^n$ and outputs a single bit. That is, it’s just a function $f : \{-1,1\}^n \to \{-1,1\}$ .

That’s it! Being so simple, Boolean functions can represent many things from many different disciplines, and it’s no wonder that they appear everywhere you look (though, as they say, what you see depends on what you are looking for, and this is especially true for mathematicians). To name a few examples:

Social choice: A single bit $x_i \in \{-1,1\}$ can represent the opinion of a voting-age citizen concerning some political matter, e.g whether or not Earth should break from the Intergalactic Federation (Earthxit). The vector $x \in \{-1,1\}^n$ then represents the opinion of all $n$ citizens, and a Boolean function $f(x)$ represents some way of deciding how to act based on those $n$ opinions. There are many ways to do this, as can be seen by the diversity of different voting schemes in the different countries of the world. For most of you, I suppose, the most natural way to aggregate the opinions is via the majority function, which outputs $1$ if $\sum_i x_i > 0$ and outputs $-1$ if $\sum_i x_i < 0$ (i.e outputs $1$ if the majority of the inputs are $1$ and vice versa). For others, perhaps a more natural option is the dictator function, which just outputs $x_j$ for some index $j$ , corresponding to the idea that only the opinion of person $j$ matters.
Algorithms: A very large portion of theoretical computer science and algorithms deals with concrete questions of the form “given some data structure, how do I know if it satisfies some property?”. For example, given a graph on $N$ vertices, does it contain a Hamiltonian path? Is it a tree? Does it contain a clique of size $\log N$ ? Is it 4 colorable? Is it isomorphic to some other graph? Is it planar? And many more. Now, a graph on $N$ vertices has a maximum of $n = {N \choose 2}$ potential edges, and it is completely described by stating which of those edges appear, and which don’t. So a graph can be identified with a vector $x \in \{-1,1\}^n$ where each bit $x_i$ corresponds to some possible edge: $x_i = 1$ if the edge is in the graph, and $x_i = -1$ if it is not. Thus, each “graph property” you can think of gives rise to a Boolean function.

For the connoisseur: Of course, the same is true not only for graphs; any Turing machine (which halts) gives a function on $n$ bits. But I’d like to stay away from the stratosphere, since if we keep things too abstract we lose sight of the road and of practical theorems. Graph theory and graph algorithm-related questions, as history has shown, provide a good supply of Boolean functions to study.
Does this graph contain a clique of size 10?
Percolation: Similarly related to the idea of “graph properties” but totally deserving an entry of its own is the percolation function. In this example, you start with a square which is divided into hexagons. Each hexagon is colored either blue or yellow, and so can be described by a single bit $x_i$ . The percolation function returns $1$ if there is a path of yellow hexagons from the left edge of the square to the right edge, and $-1$ otherwise. The quest to understand percolation has driven forward much of the theory of Boolean functions and their properties, as is explained in the aptly named informative book “Noise sensitivity of Boolean functions and percolation”.
Is there a left-to-right crossing of yellow hexagons?
Combinatorics: A favorite pastime of mathematical combinators is to take an $n$ -element set $S = \{1,2,\ldots,n\}$ , and ask all sorts of questions about how its different subsets interact with each other, such as “what is the largest possible family of subsets of $S$ so that no set contains the other?“. As it turns out, a family $\mathcal{F}$ of subsets of $S$ can actually be viewed as a Boolean function, in the following way. Each $x \in \{-1,1\}^n$ represents a subset of $S$ , with $x_i = 1$ if element number $i$ is in the subset and $x_i = -1$ otherwise. A family of subsets $\mathcal{F} \subseteq 2^S$ then gives rise to a Boolean function $f_\mathcal{F}$ , with $f_\mathcal{F}(x) = 1$ if the subset represented by $x$ is in $\mathcal{F}$ .
And many more: There are a myriad other examples and uses. In fact, Boolean functions have already had a spotlight in this very blog, when I investigated random walks on the hypercube. Indeed, there are just too many to list. But that shouldn’t stop you from trying; as the saying goes, “The world is your oyster, and Boolean functions are the pearl”.

Inequalities

Mathematicians, being what they are, usually don’t care about any particular Boolean function and how to compute it, but rather are interested in general properties of Boolean functions, and how different properties relate to each other. And mathematicians, being what they are, have devised a fairly long list of properties and inequalities between them.

One of the central quantities of a Boolean function is its influence, which is a measure of how important each individual bit $x_i$ is to determining the value of the function $f(x)$ . The definition is this:

Definition: Pick $x$ uniformly at random among all $2^n$ possible vectors. The influence of bit number $i$ on function $f$ , denoted $\mathrm{Inf}_i(f)$ , is the probability that $f(x)$ changes its value when you flip the $i$ -th bit:

$\mathrm{Inf}_i (f) = \mathbb{P}\left[f(x_1,x_2,\ldots,x_i,\ldots,x_n) \neq f(x_1,x_2,\ldots,-x_i,\ldots,x_n)\right].$

The dictator function $f(x) = x_j$ shows well the purpose of this definition. For any input, flipping the $j$ -th bit always changes the output, so the influence of the dictating bit $j$ is $1$ , while flipping any other bit does nothing whatsoever to the output, so the influence of the rest of the puny bits is $0$ . Figuring out the influence of different bits for different functions is a fun game for the whole family. For some functions (such as majority), it is relatively easy; for others (such as percolation), you’re going to have to sweat a little.

The definition of influence uses elementary probability, and this is very common in the analysis of Boolean functions. For example, when faced with a new Boolean function, one of the very first questions you will ask is “is it balanced?”, meaning, does it output $1$ as often as it outputs $-1$ ? You can quantify the bias of a function as

$\mathrm{Bias}(f) = \sum_{x\in \{-1,1\}^n} f(x).$

This will be $0$ if $f$ outputs $1$ as often as it outputs $-1$ , but nonzero otherwise. Since there are $2^n$ possible values of $x$ to go through, the sum can potentially be as large as $2^n$ , and to be able to compare functions for different $n$ values, we usually use a normalized bias, whose range is in $\left[-1,1\right]$ :

$\mathrm{Bias}(f) = \frac{1}{2^n}\sum_{x\in \{-1,1\}^n} f(x).$

Ah, but now this already has a probabilistic flavor, since $1/2^n$ is exactly the probability of picking any particular vector $x \in \{-1,1\}^n$ uniformly at random. The normalized bias can then be written as an expectation:

$\mathrm{Bias} = \mathbb{E}_{x \sim \text{Uniform}\left(\{-1,1\}^n\right)}f(x).$

This sort of thing is so popular that we will often just write $\mathbb{E}f$ and be done with it. The probabilistic notation conveniently leads to the “variance” of a function, which is defined exactly like in probability:

$\mathrm{Var}(f) = \mathbb{E}(f - \mathbb{E}f)^2.$

Since $f(x) = \pm 1$ , a function which is balanced (i.e has $\mathbb{E}f = 0$ ) will always have variance $1$ . However, the more biased a function is towards either $1$ or $-1$ , the smaller its variance.

Having defined two different quantities, the first thing that a mathematician does is try to compare them. Here is a theorem, called the “Poincaré inequality”, which is true for all Boolean functions:

Theorem:

$\mathrm{Var}(f) \leq \sum_{i=1}^n \mathrm{Inf}_i(f).$

Amazing, isn’t it? Poincaré’s inequality is just the first among many inequalities which try to relate between the variance of a function and its influences. Among the Booleaners, it’s actually considered a rather weak inequality, and is not tight for any interesting function, really. For example, if you did your homework with your family above, then you probably discovered that for the majority function, every bit has influence on the order of $\sim 1/\sqrt{n}$ . This means that the term on the right hand side is of size about $\sqrt{n}$ , while the term on the left hand side is a mere puny $1$ ! That’s a lot of slack.

A much more powerful (indeed, revolutionary) inequality is the one proven by Kahn, Kalai and Linial in 1988.

Theorem (KKL): Let $\delta$ be the largest influence of $f$ , i.e $\delta = \max_i \mathrm{Inf}_i(f)$ . Then

$\mathrm{Var}(f) \leq \frac{C}{\log (1/\delta)} \sum_{i=1}^{n}\mathrm{Inf}_i(f).$

Here $C$ is some fixed constant, and we don’t really care about its value.

We are usually interested in the behavior of these types of theorems when the number of bits $n$ goes to infinity (otherwise, there is only a finite number of functions, and so only a finite number of different values that $\mathrm{Var}(f)$ and $\mathrm{Inf}_i(f)$ can get, and we can always pick $C$ to be large enough so that the theorem trivially holds). And when the number of bits goes to infinity, we also want the maximum influence $\delta$ to go to $0$ , for quite the same reason. But as $\delta \to 0$ , the factor $1/\log(1/\delta)$ becomes very small! so for all interesting cases, the quantity in the right hand side of the KKL inequality is smaller than that of the right hand side of the Poincaré inequality; the KKL theorem is indeed the stronger of the two. It is still not tight for the majority function, but there are other interesting functions for which it is tight.

Fourier for all

The KKL inequality is the industry-standard, go-to inequality for Boolean functions; efficient, dependable, and quite importantly, analysable. Its most common proof showcases the mathematical gizmos and gadgets used in Boolean analysis. Let’s take a peek at them.

At the core of everything Boolean lies the Fourier decomposition: Every function $f: \{-1,1\}^n \to \mathbb{R}$ can be written as a multivariate polynomial:

$f(x) = \sum_{S\subseteq [n]} \widehat{f}(S) \prod_{i\in S} x_i,$

where the term $\widehat{f}(S)$ is called the “Fourier coefficient” of the set $S$ . Here are two ways to see why this is true:

The brute force way: One way to “compute” the function $f(x)$ is to go over all possible vectors $y$ , and ask, “is $x$ equal to $y$ ? If so, then return $f(y)$ !” Mathematically, this amounts to writing $f$ as the sum over the indicators of its inputs:
$f(x) = \sum_{y \in \{-1,1\}^n} f(y) \cdot \mathbf{1}_x(y),$
where $\mathbf{1}_x(y)$ is the function which returns $1$ when $x = y$ and $0$ otherwise. Luckily, this function can be written as a polynomial in the entries of $x$ : For every bit $x_i$ , the expression $(1+x_i y_i)/2$ is $1$ if $x_i = y_i$ (i.e they have the same sign) and $0$ if $x_i \neq y_i$ (i.e they have different signs). Since we need every bit $x_i$ to agree with every bit $y_i$ , we just have
$\mathbf{1}_x(y) = \frac{1}{2^n}\prod_{i=1}^{n}(1+x_i y_i).$
Plugging this into the sum over indicators and summing over possible monomials instead of over possible $y$ values then gives the decomposition. Note that some of the $\widehat{f}(S)$ may be $0$ .
The elegant way: Like everything elegant in the world, this way uses linear algebra. For a set $S \subseteq [n]$ , denote $\chi_S(x) = \prod_{\i\in S}x_i$ ; this is called the “character” of $S$ . A simple calculation shows that for $S \neq T$ , the correlation between $\chi_S$ and $\chi_T$ vanishes:
$\mathbb{E}[\chi_S \chi_T] = 0.$
Thus the set of characters $\{\chi_S\}_{S\subseteq [n]}$ constitute a complete orthonormal basis under the scalar product $\langle f, g \rangle = \mathbb{E}[f g]$ , and so every function $f$ (being a member of the $2^n$ -dimensional linear space of all functions on $\{-1,1\}^n$ ) can be written as a linear combination of characters $\chi_S$ . To obtain the coefficients of the combination, all you need to do is take a scalar product again:
$\widehat{f}(S) = \langle f, \chi_S \rangle = \mathbb{E}[f \chi_S].$

The second, elegant way has a few advantages. First, if you took a functional analysis course in kindergarten, it makes you feel proud to finally see it applied in a nontrivial discrete system. Congratulations! Second (and perhaps arguably more important), the connection to classical functional analysis automatically gives us access to a vast repertoire of theorems and results about this decomposition. For example, we immediately know that the decomposition is unique, or that an equality known as “Parseval’s identity” holds:

$\sum_{S \subseteq [n]} \widehat{f}(S)^2 = \langle f, f \rangle.$

By definition of the particular scalar product we use, $\langle f,g \rangle = \mathbb{E}[fg]$ , this actually gives

$\sum_{S \subseteq [n]} \widehat{f}(S)^2 = \mathbb{E}[f^2].$

We can then write the expectation and variance of a function $f$ in terms of its Fourier coefficients: The expectation is just the coefficient of the empty set:

$\mathbb{E}[f] = \langle f, 1 \rangle = \langle f, \chi_{\emptyset} \rangle = \widehat{f}(\emptyset),$

and using this fact and Parseval’s identity, the variance is

$\mathrm{Var}(f) = \mathbb{E}(f - \mathbb{E}[f])^2 = \sum_{S \neq \emptyset} \widehat{f}(S)^2$

(since we subtract the expectation, we take the square of all coefficients except that of the empty set).

This is very neat; we have found a way to express the variance of a function using its Fourier coefficients. Perhaps, if the same thing can be done for the influences $\mathrm{Inf}_i(f)$ , we could compare directly between the two quantities and prove inequalities like Poincaré and KKL. For this, we use derivatives:

Definition: The $i$ -th derivative of a Boolean function $f$ , writen $\partial_i f$ , is given by the difference of $f$ ‘s value when $x_i$ takes the value $1$ and $-1$ :

$\partial_i f(x) = \frac{1}{2}(f(x_1, \ldots,x_{i-1}, 1, x_{i+1},\ldots,x_n) - f(x_1, \ldots,x_{i-1}, -1, x_{i+1},\ldots,x_n)).$

The derivative function can take on three possible values: $\pm 1$ , if $f$ ‘s value changes when flipping bit $i$ , or $0$ if it remains unchanged. So the square of the derivative, $\partial_i f(x)^2$ , has value $1$ if flipping bit $i$ changes the output. Since $\mathrm{Inf}_i(f)$ is just the probability of this thing happening, the influence is equal to

$\mathrm{Inf}_i(f) = \mathbb{E}[\partial_i f(x)^2].$

Luckily, if we know the Fourier representation of $f$ , then we also know the Fourier representation of its derivative: You can differentiate each monomial separately, and find that

$\partial_i f(x) = \sum_{S \text{ s.t } i\in S} \widehat{f}(S) \prod_{j\in S - \{i\}}x_j.$

Again invoking Parseval’s identity,

$\mathbb{E}[\partial_i f(x)^2] = \sum_{S \text{ s.t } i\in S} \widehat{f}(S)^2 .$

If we sum over all indices $i$ , we’ll again get a sum involving the Fourier coefficients of $f$ , but now each coefficient will be multiplied by some factor: Each set $S$ appears once for each $i \in S$ , so each Fourier coefficient $\widehat{f}(S)$ is counted $|S|$ times. The empty set $S=\emptyset$ won’t be counted at all, of course. Thus

$\sum_i \mathrm{Inf}_i(f) = \sum_i \mathbb{E}[\partial_i f(x)^2] = \sum_{S \neq \emptyset} |S| \widehat{f}(S)^2.$

Congratulations! We have just proven Poincaré’s inequality: The variance $\mathrm{Var}(f)$ had exactly the same form, but without the term $|S|$ multiplying each Fourier coefficient. But $|S|\geq1$ always, so the sum of influences is always larger than the variance. And, if your body is willing and your spirit is strong, you can even deduce from these representations for which functions the Poincaré inequality is actually an equality.

Hypercontractivity

Unfortunately, the KKL inequality requires more than basic manipulation of Fourier coefficients (indeed, it requires advanced manipulation of Fourier coefficients). The standard proof uses what is known as the “hypercontractive inequality”. Summed up in one cryptic sentence, this inequality tells you how different ways of sizing up a function (aka different norms) relate to each other when you smooth out the sharp features of the function (aka adding noise). Let’s see what all of that means.

One important way of studying the influences of a function is to see how it changes under noise. You can write entire books on this topic (indeed, people have), so I’ll just present the basics here.

First, what is noise? Suppose that $x \in \{-1,1\}^n$ is a point in the hypercube. A random vector $y \in \{-1,1\}^n$ is said to be $\rho$ -correlated with $x$ if each bit $y_i$ is obtained from $x_i$ by setting $y_i = x_i$ with probability $\frac{1+\rho}{2}$ , and setting $y_i = -x_i$ with probability $\frac{1-\rho}{2}$ . In other words, $y$ is like $x$ , but each bit of has a chance of $1-\rho$ of being “rerandomized”, i.e has a chance of being uniform random. When $\rho = 1$ , we just have $y = x$ , since the probability to be rerandomized is $0$ . When $\rho = 0$ , then $y$ is just uniformly random, since the probability of being rerandomized is $1$ , and when $\rho$ is somewhere in between, then $y$ is somewhere in between; the larger the $\rho$ , the higher the probability that many bits of $y$ agree with the bits of $x$ . We often think of $y$ as a “noised version” of $x$ . This has a geometric interpretation: For small $\rho$ values, most of $y$ ‘s bits should be identical to those of $x$ , and so it is (physically) close to it in the hypercube.

The probability of getting a point y which is correlated with the lower-left corner is given by the color; the brighter the vertex, the higher the probability.

Using this definition of noise, we can look at a “smoothed out” version of $f$ , where instead of looking at $f(x)$ , we look at $f(y)$ where $y$ is a noised version of $x$ . This is called the noise operator.

Definition: The noise operator of $f$ , denoted $T_\rho f : \{-1,1\}\to \mathbb{R}$ , is defined as

$T_\rho f (x) = \mathbb{E}[f(y)],$

where $y$ is $\rho$ -correlated with $x$ .

I say that this is “smoothed” out because what it does in effect is take weighted averages over all the values of $f$ , where larger weights are given to points close to $x$ , and smaller weights are given to points far away from $x$ (since there is a higher probability for $y$ to be close to $x$ than far). For $\rho = 0$ , all the weights in the average are the same, and we just take a simple average of all the points in the hypercube, i.e $T_0f$ is the constant function $T_0f = \mathbb{E}[f]$ . For $\rho = 1$ , we take the average of only one point, and $T_1f = f$ . Another reason that that is $T_\rho f$ is called “smoothed” is that whereas the original function $f$ takes values only in $\{-1,1\}$ , the noised version takes values in the continuous interval $[-1,1]$ . Smooth indeed 😎.

For the connoisseur: If you peek inside textbooks, you might see $T_\rho$ appear under the name “heat operator”. Indeed, you can imagine that a point $x$ in the hypercube is a hot body (say, a hot potato), while all the other vertices are cold (say, cold potatoes). Imagine also that vertices transfer heat to each other via conduction, and every vertex transfers heat equally to all its neighbors in proportion to the temperature difference. In this case, it turns out that the temperature at a vertex $z$ is proportional to the probability that $y = z$ , where $y$ is $e^{-t}$ -correlated with $x$ .

It can sometimes be a bit hard to understand how $T_\rho f$ behaves, but as usual, one way to study it is to see what it does to the Fourier coefficients. Since the expectation is linear, i.e $\mathbb{E}[\alpha f+\beta g] = \alpha \mathbb{E}[f] + \beta \mathbb{E}[g]$ , we can apply $T_\rho f$ to each monomial in the Fourier decomposition of $f$ :

$T_\rho f(x) = \sum_{S \subseteq [n]} \widehat{f}(S) \cdot \left( T_\rho \prod_{\i\in S} x_i \right)(x).$

Luckily, the effect of $T_\rho$ on the monomial $\prod_{\i\in S} x_i$ is relatively easy to understand: Since all the bits $y_i$ are independent, it’s just the product of $T_\rho$ on each individual $x_i$ , and each one of those is equal to

$\left(T_\rho x_i\right)(x) = \mathbb{E}[y_i] = \left(\frac{1+\rho}{2}\right)x_i + \left(\frac{1-\rho}{2}\right)(-x_i)$

$= \rho x_i.$

So all in all,

$T_\rho f(x) = \sum_{S \subseteq [n]} \widehat{f}(S) \cdot \rho^{|S|} \prod_{\i\in S} x_i,$

and if we know the Fourier coefficients of $f$ , then we know the Fourier coefficients of $T_\rho f$ ! Namely,

$\widehat{T_\rho f}(S) = \widehat{f}(S)\cdot \rho^{|S|}.$

This representation is very useful, and Booleanists learn it by heart from an early age.

While the noise operator certainly changes the values of $f$ , it does not change the expectation of $f$ , so if we take a sum of $f(x)$ over all $x$ and a sum of $T_\rho f(x)$ over all $x$ , we get the same result. Of course, this does not hold for averages over other quantities derived from $f$ , for otherwise there would be no difference between $f$ and $T_\rho f$ . For example, in general,

$\mathbb{E}[f ^2] \geq \mathbb{E}[(T_\rho f)^2]$

(since $f^2 = 1$ always and $(T_\rho f)^2$ can be smaller than $1$ ). Many doors now open, and the mathematically-curious can wonder how expectations of quantities of $f$ relate to expectations of quantities of $T_f$ . One class of heavily-investigated such quantity is the $p$ -norm:

Definition: Let $f:\{-1,1\}^n \to \mathbb{R}$ . For $p\geq1$ , the $p$ -norm of a function $f$ , denoted $\|f\|_p$ , is defined as

$\|f\|_p = \left(\mathbb{E}[|f|^p]\right)^{1/p}.$

The $p$ -norm is important in many branches of analysis, and Boolean analysis is no exception. The friendliest norm you can find is the $2$ -norm: Its cheerful disposition arises mainly from Parseval’s inequality, from which we learn that

$\|f\|_2 = \left(\mathbb{E}[f^2]\right)^{1/2} = \left(\sum_{S \subseteq [n]} \widehat{f}(S)^2 \right)^{1/2};$

this connects the study of norms to the study of Fourier coefficients. Still, mathematicians and computer scientists do not take favorites, and apart from the $2$ -norm, other $p$ -norms are uniformly studied (if not loved). A bit of analysis using convexity can show the following inequality between norms: If $1\leq p \leq q$ , then

$\|f\|_p \leq \|f\|_q.$

In other words, taking a larger $p$ only increases the size of the norm. To get a feel of why this is true, think of a function $f$ which outputs $1$ on a single input $x_0$ , and outputs $0$ on all the rest. The expectation $\mathbb{E}[|f|^p]$ will always be $1/2^n$ , but the root $1/p$ on the outside makes the norm larger for larger values of $p$ .

For the connoisseur: The keen-eyed among you will have noticed that the inequality between norms is exactly the opposite of norms for functions on $\mathbb{R}^n$ (and general normed spaces); in those cases, we have $\|f\|_p \geq \|f\|_q$ for $p\leq q$ . The reason for this reversal is the introduction of the $1/2^n$ factor by the expectation for the Boolean case.

The inequality $\|f\|_p \leq \|f\|_q$ can be thought of as follows. As the name suggests, the $p$ -norms are “norms”, as in, they can be used to measure distance. The norm $\|f\|_p$ measures the “distance” of $f$ from the constant function $0$ , and the distance between two functions $f$ and $g$ is given by $\|f - g\|_p$ . The inequality is then equivalent to saying that the identity mapping between the space of all Boolean functions where distances are measured using the norm $\|\cdot\|_q$ and the space of all Boolean functions where distances are measured using the norm $\|\cdot\|_p$ is a contraction. More succinctly: The operator $\mathrm{Id}: \mathrm{L}^q \to \mathrm{L}^p$ is a contraction.

With this notion in mind, we come to perhaps the most famous theorem in the study of norms of Boolean functions, the “Hypercontractive inequality”. This is a really fancy-sounding, sex-appealing name for a theorem which relates the norm of $f$ and the norm of its smoothed version, $T_\rho f$ . It states that:

Theorem: For all $0 \leq \rho \leq 1$ ,

$\|T_\rho f\|_2 \leq \|f\|_{1+\rho^2}.$

Since $\rho \leq 1$ always, the subscript 2 is always larger than the subscript $1+\rho^2$ , which means that the inequality goes in the opposite direction of the “inequality of the contractive identity”. In other words: Although the $2$ -norm of $f$ is always larger than the $1+\rho^2$ -norm of $f$ , the inequality is reversed if we smooth $f$ out by $\rho$ amount of noise. In even other other words: The noise operator $T_\rho: \mathrm{L}^{1+\rho^2} \to \mathrm{L}^2$ is a contraction! This is the origin of the overwhelming “hypercontractive”: The noise operator $T_\rho$ is so good at equalizing and reducing the norm of a function, that not only is it contracting in its original space (i.e $\|T_\rho f\|_2 \leq \|f\|_2$ ), but it is even a contraction when comparing to otherwise smaller norms!

When I just started learning about Boolean functions, I didn’t know what all the fuss was about. After all, it’s just another inequality. But if there’s one thing that the analysts are good at, it is squeezing seemingly obscure inequalities to their very last drop of sweet, theorem-proving juice. Let’s see how that works out for the KKL inequality.

Proving KKL

Armed with the hypercontractive inequality, proving KKL is as easy as a brisk walk in an $n$ -dimensional park; the elegant proof I present here is from the book on noise sensitivity. In essence, the proof uses the Fourier decomposition of $\mathrm{Var}(f)$ to show that much of the contribution to the variance can be bounded by smoothed-out versions of the derivatives of $f$ . These smoothed-out versions can then be bounded using the hypercontractivity theorem. To save you some scrolling, here is the KKL inequality:

Theorem (KKL): Let $\delta$ be the largest influence of $f$ , i.e $\delta = \max_i \mathrm{Inf}_i(f)$ . Then

$\mathrm{Var}(f) \leq \frac{C}{\log (1/\delta)} \sum_{i=1}^{n}\mathrm{Inf}_i(f).$

Assume for now that $\delta \leq 1 / 1000$ (we’ll show what happens when $\delta > 1/1000$ afterwards). We start the same way we proved the Poincaré inequality, by writing $\mathrm{Var}(f)$ as a sum of Fourier coefficients:

$\mathrm{Var}(f) \leq \sum_{1 \leq |S| \leq n} \widehat{f}(S)^2.$

What we did when proving the Poincaré was to multiply each summand in the right hand side by $|S|$ ; since $|S|\geq 1$ , this only made the summands larger. When we did this, the right hand side was equal to the sum of influences, by the Fourier analysis we did before.

This time, we’ll do something similar, but use a trick before multiplying by $|S|$ : We pick some number $K$ (to be defined later) and divide the sum into two parts: All those $|S|$ which are smaller than $K$ , and all those which are larger:

$\mathrm{Var}(f) \leq \sum_{0 \leq |S| \leq K} \widehat{f}(S)^2 + \sum_{K < |S| \leq n} \widehat{f}(S)^2.$

For the first sum, we just multiply every summand by $|S|$ as usual. But for the second sum, since $|S|> K$ there, we actually multiply by $|S|/K$ ; this is still a number larger than $1$ , so we still only increase the size of the summands:

$\mathrm{Var}(f) \leq \sum_{0 \leq |S| \leq K} |S| \widehat{f}(S)^2 + \frac{1}{K}\sum_{K < |S| \leq n} |S| \widehat{f}(S)^2.$

The second sum is certainly smaller than the sum of all influences, again by the Fourier expansion of the sum of influences that we saw above. So

$\mathrm{Var}(f) \leq \sum_{0 \leq |S| \leq K} |S| \widehat{f}(S)^2 + \frac{1}{K}\sum_{i=1}^{n}\mathrm{Inf}_i(f).$

To bound the first sum, we use another clever trick: We multiply each summand by $2^{2K}$ and divide by $2^{2|S|}$ ; this only makes each summand larger, since $|S| \leq K$ . We can also extend the sum to all $S$ , since that only adds more terms, and we arrive at the inequality

$\sum_{0 \leq |S| \leq K} |S| \widehat{f}(S)^2 \leq 2^{2K} \sum_{|S| \geq 1 } (1/2)^{2|S|}|S| \widehat{f}(S)^2.$

Here is where, at last, the heat operator and hypercontractivity come in. Recall what the heat operator $T_\rho$ does to the Fourier representation of a function: It adds a factor of $\rho^{|S|}$ to each Fourier coefficient. So if we look, for example, at the effect of $T_\rho$ on the derivative of $f$ in direction $i$ , we get

$T_\rho \partial_i f(x) = \sum_{S \text{ s.t } i\in S} \rho^{|S|-1}\widehat{f}(S) \prod_{j\in S - \{i\}}x_j,$

and the square of the $2$ -norm of this expression is, by Parseval’s identity, just the sum of square of coefficients:

$\|T_\rho \partial_i f\|_2^2 = \sum_{S \text{ s.t } i\in S} \rho^{2|S|-2}\widehat{f}(S)^2.$

The equations practically (but metaphorically) beg us to sum this expression over all indices $i$ ; if we do that, we’ll count each set $S$ as many times as there are indices in it (very similar to the calculation for the sum of influences), giving

$\sum_{i=1}^{n} \|T_\rho \partial_i f\|_2^2 = \sum_{|S| \geq 1 } \rho^{2|S|-2}|S|\widehat{f}(S)^2.$

All that remains is choose the very down-to-earth value $\rho=1/2$ , and we have a bound for our initial sum:

$\sum_{0 \leq |S| \leq K} |S| \widehat{f}(S)^2 \leq 2^{2K} \sum_{i=1}^{n} \|T_{1/2} \partial_i f\|_2^2.$

By the hypercontractivity theorem, the $2$ -norm of the smoothed $T_{1/2}f$ is smaller than the $1+(1/2)^2 = 5/4$ -norm of the original function $f$ , so

$\sum_{i=1}^{n} \|T_{1/2} \partial_i f\|_2^2 \leq \sum_{i=1}^{n} \|\partial_i f\|_{5/4}^2.$

But the $5/4$ -norm of the derivative to the power of $5/4$ is just the influence of the function $f$ , since the derivative is $\pm 1$ when there is a change in the function’s value and $0$ otherwise! And since every influence is bounded by $\delta$ by definition, we reach the final bound on the first sum:

$\sum_{0 \leq |S| \leq K} |S| \widehat{f}(S)^2 \leq 2^{2K} \sum_{i=1}^{n} \mathrm{Inf}_i(f)^{8/5} \leq 2^{2K} \delta^{3/5} \sum_{i=1}^{n} \mathrm{Inf}_i(f).$

When we put this together with the bound on the second sum, we get a nice bound on the variance:

$\mathrm{Var}(f) \leq \left( 2^{2K} \delta^{3/5} + \frac{1}{K}\right) \sum_{i=1}^{n}\mathrm{Inf}_i(f).$

All that remains is to choose a $K$ which will minimize the expression in the parenthesis. Notice that as $K$ increases, the term $2^{2K}\delta^{3/5}$ increases, but the term $1/K$ decreases. Typically, these types of expressions are minimized when both terms are equal, or at least, nearly equal. To do this, choose

$K = \frac{3}{10}\log(1/\delta) - \frac{1}{2}\log \log (1/\delta),$

with the logs in base 2. Upon plugging this into the first term, the $\delta$ -power vanishes, and we are left with $1/\log(1/\delta)$ . As for the second term, since we assumed way way up in the beginning of this section that $\delta < 1/1000$ , the $- \log \log$ factor doesn't really make much of a dent in the much larger $\log(1/\delta)$ term, and $1/K \leq 10/\log(1/\delta)$ . So all in all,

$\mathrm{Var}(f) \leq \frac{C}{\log(1/\delta)} \sum_{i=1}^{n}\mathrm{Inf}_i(f),$

exactly as prophesied!

(And what if $\delta \geq 1/1000$ ? Well, in that case, $\log(1/\delta) \leq \log(1000)$ , so in this case the KKL theorem is basically the Poincaré inequality in disguise, with a large constant in front of the influence term).

All in all, I think this is a nice proof. The choice $\rho = 1/2$ is of course arbitrary; we could have chosen a different noise level, and gotten other constants in the various calculations. But the point remains the same: We found a suitable fixed noise level (or, “heating time”, if you think about $T_\rho$ as a heat operator) where we could safely bound the smoothed-out versions of the derivatives. In the next post I will talk about my advisor’s and my new work, which shows how finding a random noise level can help you prove even more powerful versions of KKL-like inequalities.

2 thoughts on “Catastrophic cubic crash course”

Gleb says:

May 6, 2020 at 20:36

Very nice blog post, enjoyed reading it! There may be a mistake in the formula after the paragraph that starts with “Recall what the heat operator…”. The rhs of this equation seems to be for \partial_i T_{\rho} f, not T_{\rho}\partial_i f — otherwise I think \rho has to be taken to the power |S|-1, not |S|.

1. Renan says:
  
  May 6, 2020 at 21:21
  
  Seems you are right. Fixed, thanks!

	yellowtypophile on You only need one
	James Martin on You only need one
	paul_wi11iams on Book review: Imperial Ear…
	Renan on The useful Weierstrass in…
	Weierstrass admirer on The useful Weierstrass in…

Sarcastic Resonance