Sampling Method


In this section, we consider some simple strategies for generating random samples from a given distribution. Because the samples will be generated by a computer algorithm they will in fact be pseudo-random numbers, that is, they will be deterministically calculated, but must nevertheless pass appropriate tests for randomness.
Here we shall assume that an algorithm has been provided that generates pseudo-random numbers distributed uniformly over (0, 1), and indeedmost software environments have such a facility built in.

Standard distributions
We first consider how to generate random numbers from simple nonuniform distributions, assuming that we already have available a source of uniformly distributed random numbers. Suppose that $z$ is uniformly distributed over the interval $(0, 1)$, and that we transform the values of $z$ using some function $f(·)$ so that $y = f(z)$.
The distribution of $y$ will be governed by
$p(y)=p(z) \left | \frac{dz}{dy}\right|$

where, in this case, $p(z) = 1$. Our goal is to choose the function $f(z)$ such that the resulting values of $y$ have some specific desired distribution $p(y)$. Integrating we obtain
$z=h(y)=\int_{-\infty}^{y}p(\hat{y})d\hat{y}$

which is the indefinite integral of $p(y)$. Thus, $y = h^{−1}(z)$, and so we have to transform the uniformly distributed random numbers using a function which is the inverse of the indefinite integral of the desired distribution

Rejection sampling
The rejection sampling framework allows us to sample from relatively complex distributions, subject to certain constraints. We begin by considering univariate distributions and discuss the extension to multiple dimensions subsequently.
Suppose we wish to sample from a distribution $p(z)$ that is not one of the simple, standard distributions considered so far, and that sampling directly from $p(z)$ is difficult.Furthermore suppose, as is often the case, that we are easily able to evaluate $p(z)$ for any given value of $z$, up to some normalizing constant $Z$, so that
$p(z) =\frac{1}{Zp}\tilde{p}(z)$
where $\tilde{p}(z)$ can readily be evaluated, but $Z_p$ is unknown.
In order to apply rejection sampling, we need some simpler distribution $q(z)$, sometimes called a proposal distribution, from which we can readily draw samples.

We next introduce a constant $k$ whose value is chosen such that $kq(z) \ge \tilde{p}(z)$ for all values of $z$.The function $kq(z)$ is called the comparison function and is illustrated for a univariate distribution in Figure below.

Each step of the rejection sampler involves generating two random numbers. First, we generate a number $z_0$ from the distribution $q(z)$. Next, we generate a number $u_0$ from the uniform distribution over $[0, kq(z_0)]$. This pair of random numbers has uniform distribution under the curve
of the function $kq(z$). Finally, if $u_0 > \tilde{p}(z_0)$ then the sample is rejected, otherwise $u_0$ is retained. Thus the pair is rejected if it lies in the grey shaded region in Figure above.The remaining pairs then have uniform distribution under the curve of  $\tilde{p}(z)$,  and hence the corresponding $z$ values are distributed according to $p(z)$, as desired.The original values of $z$ are generated from the distribution $q(z)$, and these sample are then accepted with probability $\tilde{p}(z)/kq(z)$, and so the probability that a sample will be accepted is given by
Thus the fraction of points that are rejected by this method depends on the ratio of the area under the unnormalized distribution $\tilde{p}(z)$ to the area under the curve $kq(z)$. We therefore see that the constant $k$ should be as small as possible subject to the limitation that $kq(z)$ must be nowhere less than $\tilde{p}(z)$.

Adaptive rejection sampling
In many instances where we might wish to apply rejection sampling, it proves difficult to determine a suitable analytic form for the envelope distribution $q(z)$. An alternative approach is to construct the envelope function on the fly based on measured values of the distribution $p(z)$ (Gilks and Wild, 1992). Construction of an envelope function is particularly straightforward for cases in which $p(z)$ is log concave,in other words when $ln p(z)$ has derivatives that are nonincreasing functions of $z$. The construction of a suitable envelope function is illustrated graphically in figure below.


The function $ln p(z)$ and its gradient are evaluated at some initial set of grid points, and the intersections of the resulting tangent lines are used to construct the envelope function. Next a sample value is drawn from the envelope distribution. This is straightforward because the $log$ of the envelope distribution is a succession of linear functions, and hence the envelope distribution itself comprises a piecewise exponential distribution of the form
Once a sample has been drawn, the usual rejection criterion can be applied. If the sample is accepted, then it will be a draw from the desired distribution. If, however, the sample is rejected, then it is incorporated into the set of grid points, a new tangent line is computed, and the envelope function is thereby refined. As the number of grid points increases, so the envelope function becomes a better approximation of the desired distribution $p(z)$ and the probability of rejection decreases.

Importance sampling
One of the principal reasons for wishing to sample from complicated probability distributions is to be able to evaluate expectations of the form $E[f]=\int f(z)p(z)dz$ . The technique of importance sampling provides a framework for approximating expectations directly but does not itself provide a mechanism for drawing samples from distribution $p(z)$.
Suppose, however, that it is impractical to sample directly from $p(z$) but that we can evaluate $p(z)$ easily for any given value of $z$. One simplistic strategy for evaluating expectations would be to discretize $z$-space into a uniform grid and to evaluate the integrand as a sum of the form

$E[f]\simeq \sum p(z^{(l)})f(z^{(l)})$

An obvious problem with this approach is that the number of terms in the summation grows exponentially with the dimensionality of $z$. Furthermore, as we have already noted, the kinds of probability distributions of interest will often have much of their mass confined to relatively small regions of $z$ space and so uniform sampling will be very inefficient because in high-dimensional problems, only a very small proportion of the samples will make a significant contribution to the sum. We would really like to choose the sample points to fall in regions where $p(z)$ is large, or ideally where the product $p(z)f(z)$ is large.As in the case of rejection sampling, importance sampling is based on the use of a proposal distribution $q(z)$ from which it is easy to draw samples, as illustrated in Figure below.

We can then express the expectation in the form of a finite sum over samples $\{z(l)\}$ drawn from $q(z)$

Markov Chain Monte Carlo(MCMC)
In the previous section, we discussed the rejection sampling and importance sampling strategies for evaluating expectations of functions, and we saw that they suffer from severe limitations particularly in spaces of high dimensionality. We therefore turn in this section to a very general and powerful framework called Markov chain Monte Carlo (MCMC), which allows sampling from a large class of distributions and which scales well with the dimensionality of the sample space. Markov chain Monte Carlo methods have their origins in physics (Metropolis and Ulam, 1949), and it was only towards the end of the 1980s that they started to have a significant impact in the field of statistics.

As with rejection and importance sampling, we again sample from a proposal distribution. This time, however, we maintain a record of the current state $z(τ)$, and the proposal distribution $q(z|z(τ))$ depends on this current state, and so the sequence of samples $z(1), z(2), . . .$ forms a Markov Section  chain. Again, if we write $p(z) = \tilde{p}(z)/Zp$, we will assume that $\tilde{p}(z)$ can readily be evaluated for any given value of $z$, although the value of $Z_p$ may be unknown. The proposal distribution itself is chosen to be sufficiently simple that it is straightforward to draw samples from it directly. At each cycle of the algorithm, we generate a candidate sample $z^*$ from the proposal distribution and then accept the sample according to an appropriate criterion.

Markov chains
Before discussing Markov chain Monte Carlo methods in more detail, it is useful to study some general properties of Markov chains in more detail. In particular, we ask under what circumstances will a Markov chain converge to the desired distribution.A first-order Markov chain is defined to be a series of random variables $z(1), . . . , z(M)$ such that the following conditional independence property holds for
$m ∈ \{1, . . . , M − 1\}$

$p(z^{(m+1)}|z^{(1)}, . . . , z^{(m)}) = p(z^{(m+1)}|z^{(m)})$.

This of course can be represented as a directed graph in the form of a chain.We can then specify the Markov chain by giving the probability distribution for the initial variable $p(z^{(0)})$ together with the conditional probabilities for subsequent variables in the form of transition probabilities $Tm(z^{(m)}, z^{(m+1)}) ≡ p(z^{(m+1)}|z^{(m)})$. A Markov chain is called homogeneous if the transition probabilities are the same for all $m$.
The marginal probability for a particular variable can be expressed in terms of the marginal probability for the previous variable in the chain in the form


Our goal is to use Markov chains to sample from a given distribution. We can achieve this if we set up a Markov chain such that the desired distribution is invariant. However, we must also require that for $m→∞$, the distribution $p(z^{(m)})$ converges to the required invariant distribution $p^*(z)$, irrespective of the choice of initial distribution $p(z^{(0)})$. This property is called ergodicity, and the invariant distribution is then called the equilibrium distribution. Clearly, an ergodic Markov chain can have only one equilibrium distribution. It can be shown that a homogeneous Markov chain will be ergodic, subject only to weak restrictions on the invariant distribution and the transition probabilities (Neal, 1993).

In practice we often construct the transition probabilities from a set of ‘base’ transitions $B_1, . . . , B_K$. This can be achieved through a mixture distribution of the form
for some set of mixing coefficients $α_1, . . . , α_K$ satisfying $α_k \ge 0$ and $\sum_k α_k = 1$.
Alternatively, the base transitions may be combined through successive application, so that
If a distribution is invariant with respect to each of the base transitions, then obviously it will also be invariant with respect to either of the $T(z', z)$


Gibbs Sampling
Gibbs sampling (Geman and Geman, 1984) is a simple and widely applicable Markov chain Monte Carlo algorithm and can be seen as a special case of the Metropolis-Hastings algorithm. Consider the distribution $p(z) = p(z_1, . . . , z_M)$ from which we wish to sample, and suppose that we have chosen some initial state for the Markov chain. Each step of the Gibbs sampling procedure involves replacing the value of one of the variables by a value drawn from the distribution of that variable conditioned on the values of the remaining variables. Thus we replace $z_i$ by a value drawn from the distribution $p(z_i|z_i')$, where $z_i$ denotes the $i$th component of $z$, and $z_i'$ denotes $z_1, . . . , z_M$ but with $z_i$ omitted. This procedure is repeated either by cycling through the variables in some particular order or by choosing the variable to be updated at each step at random from some distribution.
For example, suppose we have a distribution $p(z_1, z_2, z_3)$ over three variables,and at step $τ$ of the algorithm we have selected values $z^{(τ)}_1 , z^{(τ)}_2$ and $z^{(τ)}_3$ . 

We first replace $z^{(τ)}_1$ by a new value $z^{(τ+1)}_1$  obtained by sampling from the conditional distribution

$p(z_1|z^{(τ)}_ 2 , z^{(τ)}_3 )$

We next replace $z^{(τ)}_2$ by a new value $z^{(τ+1)}_2$  obtained by sampling from the conditional distribution

$p(z_2|z^{(τ+1)}_ 1 , z^{(τ)}_3 )$

so that the new value for $z_1$ is used straight away in subsequent sampling steps. Then we update $z_3$ with a sample $z^(τ+1)_3$  drawn from

$p(z_3|z^{(τ+1)}_ 1 , z^{(τ+1)}_2 )$

and so on, cycling through the three variables in turn.



Comments

Popular posts from this blog

Advanced Topics in Machine Learning - CST396 KTU CS Sixth Semester Honours Notes - Dr Binu V P 9847390760

Inference in Graphical Models-Inference on Chain , Trees and factor Graphs