I’m happy to say that my advisor Ronen Eldan and I recently uploaded a paper to the arXiv under the title “Decomposition of mean-fields Gibbs distributions into product measures” (https://arxiv.org/abs/1708.05859). This is a sister paper of the previous one about exponential random graphs: this one presents a “general framework” and only briefly touches on how it can be used, while the latter is an “applications” paper which uses the framework as a black box to show some neat results.
In this post, I’d like to explain what Gibbs distributions are, what we found out about them, and why all of this is interesting.
This will be an exposition, so I’ll try not to show all the technical details and messy calculations, but some will inexorably wriggle their way in. This shouldn’t stop you from reading the rest of the post, though 🙂 .
Our work is heavily based on my advisor’s prior work concerning Gibb’s distributions, titled “Gaussian-width gradient complexity, reverse log-Sobolev inequalities and nonlinear large deviations” (https://arxiv.org/abs/1612.04346). This post can also be used to familiarize yourself with the background of that paper as well.
Our paper deals with random vectors with elements, where each entry is either or . Random vectors with entries are very common in the (mathematical) world. Here are three examples of distributions of such vectors.
The vector can represent independent flips of a coin, where an entry of represents a “heads”, and represents a “tail”. Suppose that the probability that the coin lands on “heads” is some number . Then the probability of obtaining some particular vector that has heads (and therefore tails) in coin flips is
If we wish, we can also write this probability as a slightly-cumbersome exponent, using the fact that :
In technical terms, this is just a “ iid Bernoulli random vector with success probability ”, which is pretty much the simplest random vector there is.
We can complicate this model a bit by saying that each entry represents the outcome of a different coin, and that coin has probability to come out heads. In this case, the probability of obtaining a some particular vector is a product of the form
We call this particular distribution a product measure, and say that it is governed by a probability vector . These (relatively simple) distributions will play a key role in the things to come.
Exponential random graphs
The exponential random graph model is a way of assigning probabilities to graphs.
I have written about it in greater detail in a previous blogpost, but will give a quick recap now.
In this model, denoted , we let be some function on graphs with vertices. For example, for a given graph , may count the number of triangles in ; it may return the size of the largest clique in ; or it may return 1 if is a tree and 17 otherwise. Using , we define the following probability distribution on graphs with vertices:
where is a normalizing constant that is just there to make sure that the probabilities sum up to ; it may be written as
Now, a graph with vertices can have up to possible edges: One for each pair of vertices. We can then represent a graph by a vector with entries which take values in . Each entry represents a possible edge between two vertices in the graph: An entry with the value indicates that the corresponding edge appears in the graph, while the value indicates that the edge does not appear. Thus the probability distribution of the exponential random graph model directly translates into an a probability distribution on vectors.
The Ising model
The Ising model is a model in statistical physics which attempts to capture some of the interesting properties of a magnet. For your leisure and convenience, here is a grossly oversimplified description of the physics of material magnetism; you can skip the following paragraph if all you want is the math.
A solid metal is composed of a bunch of atoms. Each of these atoms has a tiny magnetic moment, which can point either up or down. If all the atoms point upwards together, their magnetic moments add up, and the material is magnetic. The same is true if all the magnetic moments point downwards together. However, if about half of the atoms point upwards and half point downwards, they effectively cancel each other out, and the material is non-magnetic. If this were all there was to it, material-science would be easy; however, the atoms also interact with each other, and this adds a lot of complication to the system. In general, each atom feels the effect of the magnetic moments of its neighbours, and the energy of the system is lower if neighbouring magnetic moments are aligned. Real-life systems tend to be in states with low energy, so a material would “like” to have its all its magnetic moments aligned. But initial conditions + the effects of having a finite temperature usually prevent this.
The Ising model captures the above behaviour in a very simple and elegant way. Our material is modelled as a graph with vertices, where each vertex represents an atom. A common example is a (finite) 2D or 3D lattice. In the 2D case, for example, every atom has 4 neighbours (except at the boundary):
The magnetic moments of the atoms are represented by numbers: each vertex gets either a value of (if the magnetic moment points up) or (if the magnetic moments points down). The energy of the system , or the “Hamiltonian of the system” is then defined as
where is some real number which moderates the strength of interaction between the sites; it is usually controlled by the temperature of the system. If two neighbouring magnetic moments and are of equal sign (either both are up or both are down), then the product is and the energy is reduced (notice the minus sign at the front). But if they are of opposite sign, then , and the energy of the system is increased.
Following the rules of statistical physics, the probability of finding a system with low energy should be much higher than the probability of finding a system with high energy. Specifically, if is an Ising model, the probability of obtaining a particular configuration of the system is
where is some normalizing constant. In words, this formula means that configurations with low energy are exponentially more likely to appear than configurations with high energy.
This translates directly into a probability distribution of vectors , where is the number of vertices in the graph : each entry of corresponds to the direction of the magnetic moment of a vertex in . (Take care not to confuse this with the case of exponential random graphs, where was the number of possible edges in a graph, and each entry of corresponded to whether a particular edge was in or not).
All of the above probability distributions were actually examples of Gibbs distributions. We say that a random vector follows a Gibbs distributions with Hamiltonian if the probability for to be equal to is just
where is a normalizing constant that is just there to make sure that the probabilities sum up to ; it may be written as We denote such a system by .
For the case of independent -biased coin flips, the Hamiltonian was
For the case of exponential random graphs, the Hamiltonian is some graph statistic (e.g the number of triangles in the graph). For the case of the Ising model, .
Technical notes (don’t worry if these don’t bother you) :
1) In most sources, Gibbs distributions are defined to have probability proportional to rather than ; there is a minus sign in the exponent. This is purely a matter of convention, and we choose to omit this minus sign in our work.
2) We will mainly want to to talk about asymptotics in this post. For example, we might want to ask how the Ising model on a grid with atoms behaves as the grid size goes to infinity. Technically, for every there should be a different function , since was defined over vectors in . But in this post we will ignore this and think of as a single function over all possible ‘s. This lets us relate to the size of , so we can say things like “ whenever has elements”.
Admittedly, the definition by itself doesn’t tell you much about the behaviour of Gibbs distributions, since was arbitrary, and if we allow to be arbitrarily complicated, we can get any distribution we desire. That is, for every probability distribution on vectors with -entries, we can always find an so that ; simply take
But that’s ok! Because although an arbitrary distribution can always be obtained by an arbitrary Hamiltonian, we are almost always not interested arbitrary distributions. Rather, the Gibbs distributions that find their way into other applications are usually either simple to state, or simple to calculate, or are considered simple by some other measure. Two of the Hamiltonians given above: the graph triangle-counting function (which is often used in social network theory) and the Ising model (which occupies a large mental space of all statistical physicists) are good examples of this.
Indeed, our main contribution in our paper is to show that there is a notion of complexity, called “gradient complexity”, so that if the Hamiltonian is “simple” under this notion, then the Gibbs distribution defined by it becomes much more tractable. In what sense is it more tractable? In the sense that it can be well approximated by a mixture of product measures, where the probability vector of each product measure obeys some predefined equation (we’ll talk about what we mean by “mixture” soon). This is in fact the source of the paper’s title, “Decomposition of mean-field Gibbs distributions into product measures”. The adjective “mean-field”, which is a term originating from statistical mechanics, is essentially another name for “low gradient complexity”.
Now you may ask, Why is this a good thing? Why would we even want to decompose Gibbs distributions into mixtures of product measures which obey some predefined equation? The answer is that, in general, Gibbs distributions have so far been quite resilient to analysis, and we don’t understand them very well. You can read the post on exponential random graphs to see an example, but to be brief:
- Although we would very much like to do so, we have a hard time calculating the normalizing constant .
- Given a Hamiltonian , we have a hard time creating a random sample from .
- Given a black box that can generate samples from for some unknown Hamiltonian , we have a hard time figuring out what the original was.
On the other hand, product measures are much easier to deal with, especially if we know something about their probability vectors. Of course, as stated above, we don’t claim that the distribution is close to any particular product measure, but rather to a “mixture of product measures”. Suppose that is a probability measure on vectors in . A distribution is called a “-mixture”, denoted , if it can be randomly sampled as follows: first, generate a random vector according to the measure . Then, generate a random vector according to the product measure whose probability vector has expectation . In other words, we first generate random coin-biases using (by randomly generating the coins’ expectations), and then we flip independent coins, one for each generated bias.
Advanced note: As usual, unless we know something about , this definition doesn’t give us a lot of information: every distribution can be expressed as a -mixture, for which mimics the original distribution by simply has all its mass on singletons. Our results are meaningful because they show that has a very specific structure called “small tilt decomposition” which gives it useful properties, but we won’t go into that in this post.
Our main result can thus be very, very, very roughly written as
Theorem 1, rough statement: If the Hamiltonian is nice enough (i.e has low gradient complexity + some other smoothness conditions) then there exists a measure so that:
- can be approximated by a -mixture.
- Most of the mass of rests on vectors which satisfy
as , where the is applied entrywise to the entries of . Here is the one-norm of the vector , and stands for
We have already talked a bit about the first item in the above theorem, so let’s inspect the second one. I haven’t told you yet what means in this context, but for now it’s enough to think of it as an analogue to the real-differentiable gradient of a function: it is a vector with elements, where each coordinate is some simple function of . The important thing to see is that this equation tells us something about how the probability vectors of the individual product measures behave: for almost all the vectors which will appear under , we know that they are close to satisfying the fixed point equation
If we can solve this equation, then we can say something meaningful about how looks like, and hence how the product measure looks like.
I say “are close to satisfying the fixed point equation…”, because there can still be some error: all we require is that is asymptotically smaller than , something which is often written as
This means that almost every entry of is close to the corresponding entry of ; indeed, if the distance between them as vectors is , then almost every individual entry is . Of course, this may not always be an easy equation to solve.
As we will soon see, there is a technical and precise way to define what “nice” means in this context. But before we do that, I’d first like to show some intuition as to what sort of Hamiltonians might be decomposable into product measures, and what sort of Hamiltonians might not.
Intuitively, we might think that since product measures have independent entries, mixtures of product measures also still contain a lot of independence. That is, if I know that a certain element in my vector is, say, 1, this shouldn’t affect the probability of other vertices being 1 by too much. Thus, Hamiltonians for which there is not too strong a dependence between ‘s entries should be decomposable, while Hamiltonians with very strong dependence between entries should not. Of course, for a Gibbs distribution to be interesting it must have at least some dependence between the entries of , otherwise it is just a product measure.
An example for a Hamiltonian which does not satisfy this requirement is the 2D, 4-neighbor Ising model described above. For suppose that I know that the magnetic moment of a certain atom is pointing upwards. This has a great impact on its four neighbors: I am now certain about 1/4th of the forces acting on the neighboring atoms, and can be much more confident that they will point upwards as well. Thus there is a (relatively) strong dependence between adjacent nodes in this model (further, this dependence propagates: since we know that the neighbors are more likely to point upwards, their neighbors are also more likely to point upwards, and so on; but the effect is felt strongest in the immediate vicinity of the initial site).
On the other hand, consider a similar 2D Ising model where each atom is now connected to a large neighborhood: all pairs of sites are considered neighbors if they are within distance from each other, where is not a constant but increases as the size of the lattice grows.
In this model, the effect of any individual atom on its neighbors is minuscule: every magnetic moment is influenced by a large number of other atoms, and the dependence of any atom on any specific individual neighbor decreases to 0 as . We can therefore expect neighboring sites to be “almost independent”, something which hopefully can be captured by a mixture model of product measures.
Another way to see the difference between the models is as follows. In the regular 2D model, where each site only has 4 neighbors, I can pick a site, and fix its 4 neighbors to be either all up, or all down. If I fix them to point upwards, then with relatively high probability, the atom in the middle will also point upwards; likewise if I fix them to point downwards. The whole affair only deals with 5 atoms, so very roughly speaking, I can fix about of the sites to point in any arbitrary directions I wish (there are problems at the interface between such clumps, but let’s not worry about that, this is just for intuition).
On the other hand, if each site is influenced by many other sites, as in the second model we described, things are much “smoother”. If we wish to control the direction of a particular site, we have to fix a lot of magnetic moments in the ball around that site. Since the radius of the ball increases with , this means that we can only arbitrarily control sites.
Thus, for the 4-neighbor model, we can create probable vectors where about of the sites point in arbitrary directions, but for the large-ball model, only of them can be made arbitrary. In this sense, the latter model is much smoother than the former. Our definition of “Gaussian Width” will utilize this observation.
Anyway, back to the theorem statement; to fully understand it, I still owe you several things:
- What does it mean for to be “nice”?
- What does it mean to be approximated by a -mixture?
- What does mean for a function that is normally defined on ?
I’ll now give answers to these three questions. The details will be a bit technical (but only a bit!). If you are fine with the above theorem formulation as it is, though, you can skip on ahead to the next section of this post (titled “An application”).
1) We start with what it means for to be nice, and for that we need define what it means to “differentiate” a function whose domain is . For a coordinate , define the -th partial derivative of as
In other words, we look at the change in the value of when we flip just a single coordinate. This lets us define two important quantities. The first is the discrete gradient of :
which is just a straightforward analog of the real-differentiable gradient. The second is the Lipschitz constant of :
This is reminiscent of the good ol’ Lipschitz constant we are used to having in standard analysis, since by the triangle inequality, for every we have
The Gaussian Width of a set is defined as
where is a standard Gaussian vector in .
In other words, we do the following. First, we generate a random Gaussian vector , that is, a vector where each of its entries is a standard good-ol’-Gaussian random variable. Then we look inside the set for the point with the highest correlation with . We then take an average of this over the distribution of .
The name “Gaussian Width” comes from the fact that this quantity, in a way, represents the “width” of the set , only normalized by the factor of a Gaussian. The gradient complexity of a function is then just defined as
In words, it is the Gaussian Width of the range of the discrete gradient of .
So is the average highest correlation of a random Gaussian with the possible gradients . We can relate this to the intuition we described earlier, when we contrasted the non-smoothness of the 4-neighbor Ising model with the smoothness of the large-ball Ising model.
Suppose you already generated the vector . For the 4-neighbor model, look at the largest entries in absolute value of . About half of them will be positive, and about half of them will be negative. But because we can arbitrarily control of the entries of , there exists a state which highly correlates with those entries of : just make sure that the signs align. Since the entries of are independent, the rest of the entries won’t contribute much either positively or negatively to the correlation, and we get that is of order of magnitude .
On the other hand, this trick cannot be repeated with the large-ball model, since there we can only arbitrarily fix entries. While this doesn’t by itself mean that the large-ball model has low complexity – this still needs to be proven formally – it gives you a hunch of why it doesn’t have high complexity: the gradient is too “smooth” to have a large correlation with random noise (which is just what a Gaussian vector is).
So what is a “nice” function? It is a function for which:
- is asymptotically smaller than , i.e as . Intuitively, this means that the set of possible gradients of doesn’t take up too large a volume.
- has a small Lipschitz constant; e.g does not depend on . This means that doesn’t change by too much as you change its input.
- has a small Lipschitz constant (in the metric-space sense); e.g, one that does not depend on . This means that ‘s norm doesn’t change by too much as you change its input.
As you can see, these conditions are all different ways of saying that basically can’t change too much, or too quickly. In this sense, it is “well behaved”.
2) When we say that is approximated by a -mixture, we mean that we can generate vectors both under the law of and under the law of the -mixture in such a way, so that in expectation, the distance between them is small. The formal definition goes like this: Let . A random variable is called a -mixture if there exists a coupling between the -mixture and the original random vector so that
Being well approximated corresponds to having a -mixture with as a function of ; in other words, the expected distance between vectors of the two distributions is asymptotically smaller than . This implies that almost all of the time, the differences between almost all of the individual coordinates tend to .
3) If you are especially keen eyed, you may have noticed a discrepancy between the equation and the definition of the discrete gradient : In the equation, the entries of take values in the continuous cube , while the discrete gradient is only defined over the discrete cube . How can we calculate for continuous vectors? It turns out that there are two natural ways to extend a discrete function to the entire continuous cube; I’ll describe them here only very briefly without going into the details, so don’t worry if you don’t know all of these terms. The first is to take the discrete Fourier decomposition of into monomials, and just plug in fractional values into the resulting polynomial formula. The second is to treat the discrete values of as boundary conditions for solving Laplace’s partial differential equation, and progressively solve the problem in higher and higher dimensions. As it happens, the two methods are actually equivalent! So you can use your favorite one to extend .
We can now finally state Theorem 1 in a more precise version.
Theorem 1, in almost all its glory: Let be a Hamiltonian and denote
Denote by the set
Then there exists a measure such that is a -mixture, with
Here the set is the set of “good vectors”, which are close to obeying the fixed point equation . If you recall, we said that “nice” functions are those for which , and and are constant. In this case, the error term in the definition of is , the term in definition of the -mixture goes to , and the mass of the good set goes to as . The formulation above also shows that and don’t have to be constant, but are allowed to grow somewhat with while still retaining a good bound.
The above formulation is only in “almost all its glory” because, we actually show that is something stronger than a -mixture – we show that it is a tilt decomposition – but for that you are going to have to read the paper itself :).
Now that we have a hammer, it’s to time start looking for nails. The most straightforward way to apply our theorem is to take your favorite Gibbs distribution and start calculating the gradient complexity and Lipschitz constants for its Hamiltonian. If these are small enough, hurray! You can try to solve the equation in order to say something interesting about the product measures in the decomposition.
Let’s see an example of this: The Curie-Weiss Ising model. In this model, we have a collection of atoms which all interact with each other; that is, every magnetic moment feels the effect of every other magnetic moment. The underlying graph for this model is then the complete graph. Admittedly, the model doesn’t capture physical reality very well: it is not really possible, in large blocks of matter, for all the atoms to interact with one another. However, it is useful to study such simple things, in order to gain insights into the behaviour of more complicated models.
The Hamiltonian is written as
Note that there is a term in front of the sum; the rationale for this is that the force on a single atom cannot explode to infinity, so in fact each atom only feels the average of all its neighbors. The term is often called the “inverse temperature”; if we were to attach a physical meaning to it, it would be proportional to , where is the temperature. A temperature tending to will give a gigantic which puts a strong emphasis on the interaction between atoms, while a temperature tending to will means that the interaction between atoms becomes negligible.
So, the first task is check whether the Curie-Weiss Hamiltonian is “nice”. This is not very hard to do once you treat the problem formally: it turns out that you can express the Hamiltonians of Ising models using matrices: any Hamiltonian of the form
can be written as
where is a symmetric matrix called the “interaction matrix”. Under this notation, the gradient of can be calculated to be
so all we have to do is check the Gaussian Width and Lipschitz constants of a linear operator. For the case of the Curie-Weiss model, this operator is luckily very simple: all the interaction coefficients are the same and are equal to . The interaction matrix is then just
We won’t do the calculations here, but analyzing this matrix shows that the Curie-Weiss Hamiltonian indeed satisfies our conditions for being “nice”. So we can apply our theorem, and try so see how the individual product measures of the decomposition behave.
Now, as it turns out, in this particular case, instead of grinding our gears and trying to solve the approximate fixed point equation , we can opt to solve the exact fixed point equation
The solutions to the exact equation will tell us basically everything we need to know about the approximate ones. Further, it can be shown that not much is lost if we assume that the matrix has on the diagonal as well, instead of ‘s. In this case, the above equation reduces to a scalar equation, involving no matrices at all! This is because then every entry of a solution to the vector equation satisfies
and so all the ‘s are equal to each other! Every solution of this equation is then of the form for some real number . We then only need to solve the real-valued equation
As the following image shows, for , there is only one solution, which is .
Let’s interpret this result. The vectors which solve the fixed point equation represent the expectation of a product measure. If every coordinate of the expectation is 0, then every coordinate of the product measure has a probability of getting the value , and a probability of getting the value . That is, there is absolutely no preference between magnetic moments which point up, and magnetic moments which point down. Whenever we sample a system, we can expect about half the sites to point up, and about half the sites to point down.
This is what is called in physics an “unordered state”, which, magnetism-wise, corresponds to “no magnetism”. Recalling that represents the inverse temperature, a small value of corresponds to a high temperature, so this can also be stated as follows: when the temperature is high, there can be no magnetism. This appeals to the physical intuition that when it is very very hot, the thermal fluctuations which randomly flip the magnetic moments are so large that they do not allow the system to settle down in any ordered state.
On the other hand, when , there can be more than one solution: in fact, two new solutions appear, one for positive , and one for negative :
These solutions tend towards and as . What do these solutions mean? A value of , for example, indicates that for each site, there is a probability of for the atom to point upwards, and a probability of for the atom to point downwards; that way, the expectation is indeed . In this case, about -ths of the magnetic moments will point upwards, and we have an ordered state: a magnet. This happens at low temperatures, and appeals to the physical intuition that when it is very cold, there are almost no thermal fluctuations, and the system can settle down in a state of low energy – that is, one where most of the magnetic moments point in the same direction.
Our paper contains all formalities and proofs and explanations to everything that was left unexplained above (or at least, references to proofs and explanations). I hope that this post sparked your interest enough so that you would take a look at the details. The application of our main theorem to the Curie-Weiss Ising model is only a toy example; perhaps there are many other applications which are waiting to be picked apart. For instance, we ourselves used this framework to say something interesting about exponential random graphs. But that’s already a story for another post…