PAC Learning



Probably Approximately Correct (PAC) learning framework.

PAC (Probably Approximately Correct) learning is a framework used for mathematical analysis. A PAC Learner tries to learn a concept (approximately correct) by selecting a hypothesis from a set of hypotheses that has a low generalization error. In the context of Machine Learning, a problem is PAC-learnable if there is an algorithm $A$ when given some independently drawn samples, will produce a hypothesis with a small error for any distribution $D$ and any concept $c$, and that too with a high probability. It may not be possible to find a perfect hypothesis with zero error so the goal is to find a consistent hypothesis that can predict approximately correctly with an upper bound on the error.The PAC learning framework was introduced by Valiant [1984].
Motivation 
Some computational learning questions 
• What can be learned efficiently? 
• What is inherently hard to learn? 
• A general model of learning? 
Complexity 
• Computational complexity: time and space. 
• Sample complexity: amount of training data needed to learn successfully. 
• Mistake bounds: number of mistakes before learning successfully

The PAC framework helps define the class of learnable concepts in terms of  the number of sample points needed to achieve an approximate solution, sample complexity, and the time and space complexity of the learning algorithm, which depends on the cost of the computational representation of the concepts.
We first introduce several definitions and the notation needed to present the PAC model.We denote by $X$ the set of all possible examples or instances. $X$ is also sometimes referred to as the input space. The set of all possible labels or target values is denoted by $Y$. For the  binary classification $Y$ is reduced to two labels, $Y = \{0;,1\}$.

A concept $c : X \to Y$ is a mapping from $X$ to $Y$. Since $Y = \{0, 1\}$, we can identify $c$ with the subset of $X$ over which it takes the value 1. Thus, in the following, we equivalently refer to a concept to learn as a mapping from $X$ to $\{0;,1\}$, or as a subset of $X$. As an example, a concept may be the set of points inside a triangle or the indicator function of these points. In such cases, we will say in short that the concept to learn is a triangle. A concept class is a set of concepts we may wish to learn and is denoted by $C$. This could, for example, be the set of all triangles in the plane.

We assume that examples are independently and identically distributed (i.i.d.) according to some fixed but unknown distribution $D$. The learning problem is then formulated as follows. The learner considers a fixed set of possible concepts $H$, called a hypothesis set, which might not necessarily coincide with $C$. It receives a sample $S = (x_1,\ldots  ,x_m)$ drawn i.i.d. according to $D$ as well as the labels $(c(x_1), \ldots , c(x_m))$, which are based on a specific target concept $c \in C$ to learn. The task is then to use the labeled sample $S$ to select a hypothesis $h_s \in H$ that has a small generalization error with respect to the concept $c$. The generalization error of a hypothesis $h \in H$, also referred to as the risk or true error (or simply error) of $h$ is denoted by $R(h)$ and defined as follows.


The following introduces the Probably Approximately Correct (PAC) learning framework. Let $n$ be a number such that the computational cost of representing any element $x \in X$ is at most $O(n)$ and denote by size(c) the maximal cost of the computational representation of $c \in C$. For example, $x$ may be a vector in $R_n$, for which the cost of an array-based representation would be in O(n). In addition, let $h_S$ denote the hypothesis returned by algorithm $A$ after receiving a labeled sample $S$. To keep notation simple, the dependency of $h_S$ on $A$ is not explicitly indicated.


A concept class $C$ is thus PAC-learnable if the hypothesis returned by the algorithm after observing a number of points polynomial in $1/\epsilon$ and $1/\delta$ is approximately correct (error at most $\epsilon$) with high probability (at least $1-\delta$), which justifies the PAC terminology. The parameter $\delta > 0$ is used to define the confidence $1-\delta$ and $\epsilon > 0$ the accuracy $1 -\epsilon$. Note that if the running time of the algorithm is polynomial in $1/\epsilon$ and $1/\delta$, then the sample size $m$ must also be polynomial if the full sample is received by the algorithm.

Several key points of the PAC definition are worth emphasizing. First, the PAC framework is a distribution-free model : no particular assumption is made about the distribution $D$ from which examples are drawn. Second, the training sample and the test examples used to define the error are drawn according to the same distribution $D$. This is a natural and necessary assumption for generalization to be possible in general. It can be relaxed to include favorable domain adaptation problems. Finally, the PAC framework deals with the question of learnability for a concept class $C$ and not a particular concept. Note that the concept class $C$ is known to the algorithm, but of course the target concept $c \in C$ is unknown.In many cases, in particular when the computational representation of the concepts is not explicitly discussed or is straightforward, we may omit the polynomial dependency on $n$ and size(c) in the PAC definition and focus only on the sample complexity.

Example  (Learning axisaligned rectangles) 
Consider the case where the set of instances are points in the plane, $X = R^2$, and the concept class $C$ is the set of all axis-aligned rectangles lying in $R^2$. Thus, each concept $c$ is the set of points inside a particular axis-aligned rectangle. The learning problem consists of determining with small error a target axis-aligned rectangle using the labeled training sample.We will show that the concept class of axis-aligned rectangles is PAC-learnable.


The above Figure illustrates the problem. $R$ represents a target axis-aligned rectangle and $R'$ a hypothesis. As can be seen from the figure, the error regions of $R'$ are formed by the area within the rectangle $R$ but outside the rectangle $R'$ and the area within $R'$ but outside the rectangle $R$. The firrst area corresponds to false negatives, that is, points that are labeled as 0 or negatively by $R'$, which are in fact positive or labeled with 1. The second area corresponds to false positives, that is, points labeled positively by $R'$ which are in fact negatively labeled.

To show that the concept class is PAC-learnable, we describe a simple PAClearning algorithm $A$. Given a labeled sample $S$, the algorithm consists of returning the tightest axis-aligned rectangle $R' = R_S$ containing the points labeled with 1.Figure below illustrates the hypothesis returned by the algorithm. By definition, $R_S$ does not produce any false positives, since its points must be included in the target concept $R$. Thus, the error region of $R_S$ is included in $R$.


Let $R \in C$ be a target concept. Fix $\epsilon > 0$. Let $P[R]$ denote the probability mass of the region defined by $R$, that is the probability that a point randomly drawn according to $D$ falls within $R$. Since errors made by our algorithm can be due only to points falling inside $R$, we can assume that $P[R] > \epsilon$; otherwise, the error of $R_S$ is less than or equal to $\epsilon$ regardless of the training sample $S$ received.
Now, since $P[R] > \epsilon$, we can define four rectangular regions $r1,r2, r3$ and $r4$ along the sides of $R$, each with probability at least $\epsilon/4$. These regions can be constructed by starting with the full rectangle $R$ and then decreasing the size by moving one side as much as possible while keeping a distribution mass of at least $\epsilon/4$. Figure below illustrates the definition of these regions.


Setting the right-hand side to be equal to $\delta$ and solving for $\epsilon$ concludes the proof

The theorem shows that when the hypothesis set $H$ is infinite, a consistent algorithm $A$ is a PAC-learning algorithm, since the sample complexity  is dominated by a polynomial in $1/\epsilon$ and $1/\delta.$ The generalization error of consistent hypotheses is upper bounded by a term that decreases as a function of the sample size $m$. This is a general fact: as expected, learning algorithms benefit from larger labeled training samples. The decrease rate of $O(1/m)$ guaranteed by this theorem, however, is particularly favorable.

The price to pay for coming up with a consistent algorithm is the use of a larger hypothesis set $H$ containing target concepts. Of course, the upper bound increases with $|H|$. However, that dependency is only logarithmic. Note that the term $log |H|$, or the related term $log_2 |H|$ from which it differs by a constant factor, can be interpreted as the number of bits needed to represent $H$. Thus, the generalization guarantee of the theorem is controlled by the ratio of this number of bits, $log_2 |H|$, and the sample size $m$.

Consistent Learners 
• A learner $L$ using a hypothesis $H$ and training data $D$ is said to be a consistent learner if it always outputs a hypothesis with zero error on $D$ whenever $H$ contains such a hypothesis. 
• By definition, a consistent learner must produce a hypothesis in the version space for $H$ given $D$. • Therefore, to bound the number of examples needed by a consistent learner, we just need to bound the number of examples needed to ensure that the version-space contains no hypotheses with unacceptably high error.

Example: conjunction of boolean literals
Consider learning the concept class $C_n$ of conjunctions of at most $n$ Boolean literals $x_1,\ldots, x_n.$ A Boolean literal is either a variable $x_i, i \in [n]$, or its negation $\bar{x_i}$. For $n = 4$, an example is the conjunction:$x_1\wedge  \bar{x_2}\wedge x_4$, where $\bar{x_2}$ denotes the negation of the Boolean literal $x_2$. (1, 0, 0,1) is a positive example for this concept while $(1, 0,0, 0)$ is a negative example.

Observe that for $n = 4$, a positive example (1, 0, 1,0) implies that the target concept cannot contain the literals $\bar{x_1}$ and $\bar{x_3}$ and that it cannot contain the literals $x_2$ and $x_4$. In contrast, a negative example is not as informative since it is not known which of its $n$ bits are incorrect. A simple algorithm for finding a consistent hypothesis is thus based on positive examples and consists of the following: for each positive example $(b_1, \ldots ,b_n)$ and $i \in [n]$, if $b_i = 1$ then $\bar{x_i}$ is ruled out as a possible literal in the concept class and if $b_i = 0$ then $x_i$ is ruled out. The conjunction of all the literals not ruled out is thus a hypothesis consistent with the target. Figure below shows an example training sample as well as a consistent hypothesis for the case $n = 6.$


We have $|H| = |C_n| = 3^n$, since each literal can be included positively, with negation, or not included. Plugging this into the sample complexity bound for consistent hypotheses yields the following sample complexity bound for any $\epsilon> 0$ and $\delta> 0$:

$m \ge \frac{1}{\epsilon}\big( (ln3)n+ ln \frac{1}{\delta}\big)$
Thus, the class of conjunctions of at most n Boolean literals is PAC-learnable. Note that the computational complexity is also polynomial, since the training cost per example is in $O(n)$.

Example (Universal concept class) Consider the set $X =\{0,1\}^n$ of all Boolean vectors with $n$ components, and let $U_n$ be the concept class formed by all subsets of $X$. Is this concept class PAC-learnable? To guarantee a consistent hypothesis the hypothesis class must include the concept class, thus $|H| \ge |U_n|=2^{(2^n)}$. The sample complexity bound:

$m \ge \frac{1}{\epsilon}\big( (log2)2^n+ log \frac{1}{\delta}\big)$

Here, the number of training samples required is exponential in $n$, which is the cost of the representation of a point in $X$. Thus, PAC-learning is not guaranteed by the theorem

Example:learning conjunctions of Boolean literals 
Each instance has $n$ Boolean features 
learned hypotheses are of the form $Y = X_1 ∧ X_2 ∧ ¬X_5$.
How many training examples suffice to ensure that with $prob ≥ 0.99$, a consistent learner will return a hypothesis with $error ≤ 0.05$?

$m \ge \frac{1}{\epsilon}\big(ln|H|+ ln \frac{1}{\delta}\big)$

there are $3^n$ hypotheses (each variable can be present and unnegated, present and negated, or absent) in $H$. so

$m \ge \frac{1}{0.05}\big(  ln3^n+ ln \frac{1}{0.01}\big)$

for $n=10, m ≥ 312$ for $n=100, m ≥ 2290$

Example: Learning decision Tree of depth 2



Consider a Boolean classification problem with $n$ binary variables and a hypothesis space $H$, where each hypothesis is a decision tree of depth 1. How many training examples, $m$ suffice to assure that with probability at least 0.99, any consistent learner using $H$ will output a hypothesis with true error at most 0.05?
$|H|=4n$
$m \ge \frac{1}{0.05}\big(  ln4n+ ln \frac{1}{0.01}\big)$
$n=4 m≥148$
$n=10 m≥166$
$n=100 m≥212$

Comments

Popular posts from this blog

Advanced Topics in Machine Learning - CST396 KTU CS Sixth Semester Honours Notes - Dr Binu V P 9847390760

Inference in Graphical Models-Inference on Chain , Trees and factor Graphs

Sampling Method