Inference in Graphical Models-Inference on Chain , Trees and factor Graphs
We turn now to the problem of inference in graphical models, in which some of the nodes in a graph are clamped to observed values, and we wish to compute the posterior distributions of one or more subsets of other nodes. As we shall see, we can exploit the graphical structure both to find efficient algorithms for inference, and to make the structure of those algorithms transparent. Specifically, we shall see that many algorithms can be expressed in terms of the propagation of local messages around the graph.
To start with, let us consider the graphical interpretation of Bayes’ theorem.Suppose we decompose the joint distribution $p(x, y)$ over two variables $x$ and $y$ into a product of factors in the form $p(x, y) = p(x)p(y|x)$. This can be represented by the directed graph shown in Figure 8.37(a). Now suppose we observe the value of $y$, as indicated by the shaded node in Figure 8.37(b). We can view the marginal distribution $p(x)$ as a prior over the latent variable $x$, and our goal is to infer the corresponding posterior distribution over $x$. Using the sum and product rules of probability we can evaluate
$p(y)=\sum_x' p(y|x')p(x')$
which can then be used in Bayes’ theorem to calculate
$p(x|y)=\frac{p(y|x)p(x)}{p(y)}$
Thus the joint distribution is now expressed in terms of $p(y)$ and $p(x|y)$. From a graphical perspective, the joint distribution $p(x, y)$ is now represented by the graph shown in Figure 8.37(c), in which the direction of the arrow is reversed. This is the simplest example of an inference problem for a graphical model.
Inference on a chain
Now consider a more complex problem involving the chain of nodes of the form shown in Figure below. This example will lay the foundation for a discussion of exact inference in more general graphs later in this section.We have already seen that the directed chain can be transformed into an equivalent undirected chain. Because the directed graph does not have any nodes with more than one parent, this does not require the addition of any extra links, and the directed and undirected versions of this graph express exactly the same set of conditional independence statements.
We shall consider the specific case in which the $N$ nodes represent discrete variables each having $K$ states, in which case each potential function $ψ_{n−1,n}(x_{n−1}, x_n)$ comprises an $K ×K$ table, and so the joint distribution has $(N − 1)K^2$ parameters.
Let us consider the inference problem of finding the marginal distribution $p(x_n)$ for a specific node $x_n$ that is part way along the chain. Note that, for the moment, there are no observed nodes. By definition, the required marginal is obtained by summing the joint distribution over all variables except $x_n,$ so that
In a naive implementation, we would first evaluate the joint distribution and then perform the summations explicitly. The joint distribution can be represented as a set of numbers, one for each possible value for $x$. Because there are $N$ variables each with $K$ states, there are $K^N$ values for $x$ and so evaluation and storage of the joint distribution, as well as marginalization to obtain $p(x_n)$, all involve storage and computation that scale exponentially with the length $N$ of the chain.
We can, however, obtain a much more efficient algorithm by exploiting the conditional independence properties of the graphical model. If we substitute the factorized expression for the joint distribution into above eqn , then we can rearrange the order of the summations and the multiplications to allow the required marginal to be evaluated much more efficiently. Consider for instance the summation over $x_N$. Thepotential $ψ_{N−1,N} (x_{N−1}, x_N)$ is the only one that depends on $x_N$, and so we can perform the summation
first to give a function of $x_{N−1}$. We can then use this to perform the summation over $x_{N−1}$, which will involve only this new function together with the potential ψ_{N−2,N−1}(x_{N−2}, x_{N−1})$, because this is the only other place that $x_{N−1}$ appears.Similarly, the summation over $x_1$ involves only the potential $ψ_{1,2}(x_1, x_2)$ and so can be performed separately to give a function of $x_2$, and so on. Because each summation effectively removes a variable from the distribution, this can be viewed as the removal of a node from the graph.
If we group the potentials and summations together in this way, we can express the desired marginal in the form
Here the key concept that we are exploiting is that multiplication is distributive over addition, so that
$ab + ac = a(b + c)$
in which the left-hand side involves three arithmetic operations whereas the righthand side reduces this to two operations.
We now give a powerful interpretation of this calculation in terms of the passing of local messages around on the graph.We see that the expression for the marginal $p(x_n)$ decomposes into the product of two factors times the normalization constant
$p(x_n) =\frac{1}{Z}\mu_α(x_n)\mu_β(x_n).$
We shall interpret $\mu_α(x_n)$ as a message passed forwards along the chain from node $x_{n−1}$ to node $x_n$. Similarly, $μ_β(x_n)$ can be viewed as a message passed backwards along the chain to node $x_n$ from node $x_{n+1}$. Note that each of the messages comprises a set of $K$ values, one for each choice of $x_n$, and so the product of two messages should be interpreted as the point-wise multiplication of the elements of the two messages to give another set of $K$ values.
The message $μ_α(x_n)$ can be evaluated recursively because
and then apply the above equation repeatedly until we reach the desired node.
Similarly, the message $μ_β(x_n)$ can be evaluated recursively by starting with node $x_N$ and using
This recursive message passing is illustrated in Figure below.Graphs of the form shown in Figure 8.38 are called Markov chains, and the corresponding message passing equations represent an example of the Chapman-Kolmogorov equations for Markov processes (Papoulis, 1984).
Now suppose we wish to evaluate the marginals p(xn) for every node $n ∈ \{1, . . . , N\}$ in the chain. Simply applying the above procedure separately for each node will have computational cost that is $O(N^2M^2)$. However, such an approach would be very wasteful of computation. For instance, to find $p(x_1)$ we need to propagate a message $μ_β(·)$ from node $x_N$ back to node $x_2$. Similarly, to evaluate $p(x_2)$ we need to propagate a messages $μ_β(·)$ from node $x_N$ back to node $x_3$. This will involve much duplicated computation because most of the messages will be identical in the two cases.
Suppose instead we first launch a message $μ_β(x_{N−1})$ starting from node $x_N$ and propagate corresponding messages all the way back to node $x_1$, and suppose we similarly launch a message $μ_α(x_2)$ starting from node $x_1$ and propagate the corresponding messages all the way forward to node $x_N$. Provided we store all of the intermediate messages along the way, then any node can evaluate its marginal simply by applying.
$p(x_n) =\frac{1}{Z}\mu_α(x_n)\mu_β(x_n).$
The computational cost is only twice that for finding the marginal of a single node, rather than $N$ times as much. Observe that a message has passed once in each direction across each link in the graph. Note also that the normalization constant $Z$ need be evaluated only once, using any convenient node.
Now suppose we wish to calculate the joint distribution $p(x_{n−1}, x_n)$ for two neighbouring nodes on the chain. This is similar to the evaluation of the marginal for a single node, except that there are now two variables that are not summed out. A few moments thought will show that the required joint distribution can be written in the form
Thus we can obtain the joint distributions over all of the sets of variables in each of the potentials directly once we have completed the message passing required to obtain the marginals.
Trees
We have seen that exact inference on a graph comprising a chain of nodes can be performed efficiently in time that is linear in the number of nodes, using an algorithm that can be interpreted in terms of messages passed along the chain. More generally, inference can be performed efficiently using local message passing on a broader class of graphs called trees.
In the case of an undirected graph, a tree is defined as a graph in which there is one, and only one, path between any pair of nodes. Such graphs therefore do not have loops. In the case of directed graphs, a tree is defined such that there is a single node, called the root, which has no parents, and all other nodes have one parent. If we convert a directed tree into an undirected graph, we see that the moralization step
will not add any links as all nodes have at most one parent, and as a consequence the corresponding moralized graph will be an undirected tree. Examples of undirected and directed trees are shown in Figure 8.39(a) and 8.39(b).Note that a distribution represented as a directed tree can easily be converted into one represented by an undirected tree, and vice versa.
If there are nodes in a directed graph that have more than one parent, but there is still only one path (ignoring the direction of the arrows) between any two nodes, then the graph is a called a polytree, as illustrated in Figure 8.39(c). Such a graph will have more than one node with the property of having no parents, and furthermore,the corresponding moralized undirected graph will have loops.
Factor Graphs(Frey, 1998; Kschischnang et al., 2001).
Both directed and undirected graphs allow a global function of several variables to be expressed as a product of factors over subsets of those variables. Factor graphs make this decomposition explicit by introducing additional nodes for the factors themselves in addition to the nodes representing the variables. They also allow us to be more explicit about the details of the factorization, as we shall see.Let us write the joint distribution over a set of variables in the form of a product of factors
$p(x)=\prod_s f_s(x_s)$
where $x_s$ denotes a subset of the variables. For convenience, we shall denote the individual variables by $x_i$.Each factor $f_s$ is a function of a corresponding set of variables $x_s$.
In a factor graph, there is a node (depicted as usual by a circle) for every variable in the distribution, as was the case for directed and undirected graphs. There are also additional nodes (depicted by small squares) for each factor $f_s(x_s)$ in the joint distribution. Finally, there are undirected links connecting each factor node to all of the variables nodes on which that factor depends. Consider, for example, a distribution that is expressed in terms of the factorization
This can be expressed by the factor graph shown in Figure below. Note that there are two factors $f_a(x_1, x_2)$ and $f_b(x_1, x_2)$ that are defined over the same set of variables.In an undirected graph, the product of two such factors would simply be lumped together into the same clique potential. Similarly, $f_c(x_2, x_3)$ and $f_d(x_3)$ could be combined into a single potential over $x_2$ and $x_3$. The factor graph, however, keeps such factors explicit and so is able to convey more detailed information about the underlying factorization.
Factor graphs are said to be bipartite because they consist of two distinct kinds of nodes, and all links go between nodes of opposite type. In general, factor graphs can therefore always be drawn as two rows of nodes (variable nodes at the top and factor nodes at the bottom) with links between the rows, as shown in the example in above figure. In some situations, however, other ways of laying out the graph may be more intuitive, for example when the factor graph is derived from a directed or undirected graph, as we shall see.
If we are given a distribution that is expressed in terms of an undirected graph, then we can readily convert it to a factor graph. To do this, we create variable nodes corresponding to the nodes in the original undirected graph, and then create additional factor nodes corresponding to the maximal cliques $x_s$. The factors $f_s(x_s)$ are then set equal to the clique potentials. Note that there may be several different factor graphs that correspond to the same undirected graph. These concepts are illustrated in Figure below.
Similarly, to convert a directed graph to a factor graph, we simply create variable nodes in the factor graph corresponding to the nodes of the directed graph, and then create factor nodes corresponding to the conditional distributions, and then finally add the appropriate links. Again, there can be multiple factor graphs all of which correspond to the same directed graph. The conversion of a directed graph to a factor graph is illustrated in Figure 8.42.
Comments
Post a Comment