Methods for Pattern Classification

Site: Saylor Academy
Course: CS250: Python for Data Science
Book: Methods for Pattern Classification
Printed by: Guest user
Date: Sunday, May 19, 2024, 1:01 AM

Description

At the heart of all pattern search or classification problems (either explicitly or implicitly) lies Bayes' Decision Theory. Bayes' decision simply says, given an input observation of unknown classification, make the decision that will minimize the probability of a classification error. For example, in this unit, you will be introduced to the k-nearest neighbor algorithm. It can be demonstrated that this algorithm can make Bayes' decision. Read this chapter to familiarize yourself with Bayes' decision.

Introduction

Pattern classification is to classify some object into one of the given categories called classes. For a specific pattern classification problem, a classifier is computer software. It is developed so that objects (x) are classified correctly with reasonably good accuracy. Through training using input-output pairs, classifiers acquire decision functions that classify an input datum into one of the given classes (w_i). In pattern recognition applications we rarely if ever have the prior probability P(w_i) or the class-conditional density P(x | w_i) . of complete knowledge about the probabilistic structure of the problem. In a typical case, we merely have some vague, general knowledge about the situation, together with a number of design samples or training data - particular representatives of the patterns we want to classify training. The problem, then, is to find some way to use this information to design or data train the classifier. The organization of this chapter is to address those cases where a great deal of information about the models is known and to move toward problems where the form of the distributions is unknown, and even the category membership of training patterns is unknown. We begin in Bayes decision theory (Sec.2) by considering the ideal case in which the probability structure underlying the categories is known perfectly. In Sec.3 (Maximum Likelihood), we address the case when the full probability structure underlying the categories is not known, but the general forms of their distributions are the models. Thus the uncertainty about a probability distribution is represented by the values of some unknown parameters, and we seek to determine these parameters to attain the best categorization. In Sec.4 (Nonparametric techniques) we move yet further from the Bayesian ideal and assume that we have no prior parameterized knowledge about the underlying probability structure; in essence, our classification will be based on information provided by training samples alone. Classic techniques such as the nearest-neighbor algorithm and potential functions play an important role here. We then in Sec.5 (Support Vector Machine) Next, in Sec.6 (Nonlinear Discriminants and Neural Networks), we see how some of the ideas from such linear discriminants can be extended to a class of very powerful algorithms such as backpropagation and others for multilayer neural networks; these neural techniques have a range of useful properties that have made them a mainstay in contemporary pattern recognition research. In Sec.7 (Stochastic Methods), we discuss simulated annealing by the Boltzmann learning algorithm and other stochastic methods. We explore the behavior of such algorithms with regard to the matter of local minima that can plague other neural methods. Sec.8 (Unsupervised Learning and Clustering), by addressing the case when input training patterns are not labeled and that our recognizer must determine the cluster structure.


Source: Yizhang Guan, https://www.intechopen.com/chapters/10687
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License.

Bayesian Decision Theory

Suppose that we know both the prior probabilities P(w_j) and the conditional densitiesP(x | w_j). Suppose further that we measure the features of a sample and discover that its value is x. How does this measurement influence our attitude concerning the true state of nature - that is, the category of the input? We note first that the(joint) probability density of finding a pattern that is in category w_i and has feature value x can be written in two ways: P(w_j, x) = P(W_j | ) P(x) = P(x)=P(X | W_j)P(W_j). Rearranging these leads us to the answer to our question, which is called Bayes' formula:

P\left(\omega_{j} \mid x\right)=\frac{p\left(x \mid \omega_{j}\right) P\left(\omega_{j}\right)}{p(x)} (1)

where in this case of c categories

p(x)=\sum_{j=1}^{c} p\left(x \mid \omega_{j}\right) P\left(\omega_{j}\right) (2)


Two-Category Classification

If we have an observation x for which P(W_1 | x) is greater than P(W_2 | x), we would naturally be inclined to decide that the true state of nature is W_1. Similarly, P(W_2 | x) is greater than P(W_1 | x), we would be inclined to choose W_2. Thus we have justified the following Bayes decision rule for minimizing the probability of error:

Decide w_1 if P\left(\omega_{1} \mid x\right)>P\left(\omega_{2} \mid x\right), otherwise decide w_2 (3)

In Eq. (1), P(x) is a scale factor and unimportant for our problem. By using Eq.(1), we can instead express the rule in terms of the conditional and prior probabilities. And we notice P(W_1 | x) + P(W_2 | x) = 1. By eliminating this scale factor, we obtain the following completely equivalent decision rule:

Decide w_1 if p\left(x \mid \omega_{1}\right) P\left(\omega_{1}\right)>p\left(x \mid \omega_{2}\right) P\left(\omega_{2}\right) otherwise decide w_2 (4)

While the two-category case is just a special instance of the multi-category case, it has traditionally received separate treatment. Indeed, a classifier that places a pattern in one of only two categories has a special name - a dichotomizer. Instead of using two dichotomizer discriminant functions g_1 and g_2 and assigning x to w_1 if g_1 >g_2 , it is more common to define a single discriminant function

g (x) = g_1(x) - g_2(x) (5)

and to use the following decision rule:

Decide w_1 if g (x) > 0, otherwise decide w_2

Thus, a dichotomizer can be viewed as a machine that computes a single discriminant function g(x) and classifies x according to the algebraic sign of the result. Of the various forms in which the minimum-error-rate discriminant function can be written, the following two(derived from Eqs. (1) and (5) are particularly convenient:

g(x)=P\left(\omega_{1} \mid x\right)-P\left(\omega_{2} \mid x\right) (6)

g(x)=\ln \frac{p\left(x \mid \omega_{1}\right)}{p\left(x \mid \omega_{2}\right)}+\ln \frac{p\left(\omega_{1}\right)}{p\left(\omega_{2}\right)} (7)


Multi-Category Classification

Let w_1, ....., w_c be the finite set of c states of nature. Let the feature vector  x be a d-component vector-valued random variable, and let P(x | w_j) be the state- conditional probability density function for x - the probability density function for x conditioned on w_j being the true state of nature. As before, P(w_j) describes the prior probability that nature is in state w_j. Then the posterior probability P(w_j | x) can be computed from P(x | w_j) by Bayes formula:

P\left(\omega_{j} \mid x\right)=\frac{p\left(x \mid \omega_{j}\right) P\left(\omega_{j}\right)}{p(x)} (8)

where the evidence is now

p(x)=\sum_{j=1}^{c} p\left(x \mid \omega_{j}\right) P\left(\omega_{j}\right) (9)

A Bayes classifier is easily and naturally represented in this way. For the minimum-error- rate case, we can simplify things further by taking g_i(x)=P(ω_i | x), so that the maximum discriminant function corresponds to the maximum posterior probability.

Clearly, the choice of discriminant functions is not unique. We can always multiply all the discriminant functions by the same positive constant or shift them by the same additive constant without influencing the decision. More generally, if we replace every g(x) by f)g)x)), where f (.) is a monotonically increasing function, the resulting classification is unchanged. This observation can lead to significant analytical and computational simplifications. In particular, for minimum-error-rate classification, any of the following choices gives identical classification results, but some can be much simpler to understand or compute than others:

g(x)=P\left(\omega_{i} \mid x\right)=\frac{p\left(x \mid \omega_{i}\right) P\left(\omega_{i}\right)}{\sum_{j=1}^{c} p\left(x \mid \omega_{j}\right) P\left(\omega_{j}\right)} (10)

g(x)=P\left(\omega_{i} \mid x\right)=p\left(x \mid \omega_{i}\right) P\left(\omega_{i}\right) (11)

g(x) = P(w_i | x) = In p(x | w_i) + In P(w_i) (12)

where ln denotes natural logarithm.