Properties of Naive Bayes

$\displaystyle c_{map}$	$\textstyle =$	$\displaystyle \argmax_{\tcjclass \in \mathbb{C}} \ P( \tcjclass\vert d)$	(121)
	$\textstyle =$	$\displaystyle \argmax_{\tcjclass \in \mathbb{C}} \ \frac{P(d\vert\tcjclass) P(\tcjclass)}{P(d)}$	(122)
	$\textstyle =$	$\displaystyle \argmax_{\tcjclass \in \mathbb{C}} \ P(d\vert\tcjclass) P(\tcjclass),$	(123)

We can interpret Equation 123 as a description of the generative process we assume in Bayesian text classification. To generate a document, we first choose class $\tcjclass$ with probability $P(\tcjclass)$ (top nodes in and 13.5 ). The two models differ in the formalization of the second step, the generation of the document given the class, corresponding to the conditional distribution $P(d\vert\tcjclass)$ :

$\displaystyle {\bf Multinomial} \quad P(d\vert\tcjclass)$	$\textstyle =$	$\displaystyle P(\langle \tcword_1,\ldots,\tcword_\tcposindex,\ldots,\tcword_{n_d}\rangle \vert\tcjclass)$	(124)
$\displaystyle {\bf Bernoulli} \quad P(d\vert\tcjclass)$	$\textstyle =$	$\displaystyle P(\langle e_1,\ldots,e_i,\ldots,e_M \rangle \vert\tcjclass),$	(125)

It should now be clearer why we introduced the document space $\mathbb{X}$ in Equation 112 when we defined the classification problem. A critical step in solving a text classification problem is to choose the document representation. $\langle \tcword_1,\ldots,\tcword_{n_d}\rangle$ and $\langle e_1,\ldots,e_M \rangle$ are two different document representations. In the first case, $\mathbb{X}$ is the set of all term sequences (or, more precisely, sequences of term tokens). In the second case, $\mathbb{X}$ is $\{0,1\}^M$ .

We cannot use and 125 for text classification directly. For the Bernoulli model, we would have to estimate $2^M \vert\mathbb{C}\vert$ different parameters, one for each possible combination of

values

and a class. The number of parameters in the multinomial case has the same order of magnitude.This being a very large quantity, estimating these parameters reliably is infeasible.

To reduce the number of parameters, we make the Naive Bayes conditional independence assumption . We assume that attribute values are independent of each other given the class:

$\displaystyle {\bf Multinomial} \quad P(d\vert\tcjclass)$	$\textstyle =$	$\displaystyle P(\langle \tcword_1,\ldots,\tcword_{n_d}\rangle \vert\tcjclass) =... ...1 \leq \tcposindex \leq n_d} P(X_\tcposindex=\tcword_\tcposindex\vert\tcjclass)$	(126)
$\displaystyle {\bf Bernoulli} \quad P(d\vert\tcjclass)$	$\textstyle =$	$\displaystyle P(\langle e_1,\ldots,e_M \rangle \vert\tcjclass) = \prod_{1 \leq i \leq M} P(U_i=e_i\vert c).$	(127)

**Figure 13.4:** The multinomial NB model.
$\begin{figure}\psset{unit=0.75cm} \begin{pspicture}(-0.5,0.5)(14.5,5) \psellipse... ...(7,2) \psline{->}(7,4)(10,2) \psline{->}(7,4)(13,2) \end{pspicture} \end{figure}$

**Figure 13.5:** The Bernoulli NB model.
$\begin{figure}\psset{unit=0.75cm} \begin{pspicture}(-2,0.5)(16,5) \psellipse(-0.... ...) \psline{->}(7,4)(11.5,2) \psline{->}(7,4)(14.5,2) \end{pspicture} \end{figure}$

We illustrate the conditional independence assumption in and 13.5 . The class China generates values for each of the five term attributes (multinomial) or six binary attributes (Bernoulli) with a certain probability, independent of the values of the other attributes. The fact that a document in the class China contains the term Taipei does not make it more likely or less likely that it also contains Beijing.

In reality, the conditional independence assumption does not hold for text data. Terms are conditionally dependent on each other. But as we will discuss shortly, NB models perform well despite the conditional independence assumption.

Even when assuming conditional independence, we still have too many parameters for the multinomial model if we assume a different probability distribution for each position $\tcposindex$ in the document. The position of a term in a document by itself does not carry information about the class. Although there is a difference between China sues France and France sues China, the occurrence of China in position 1 versus position 3 of the document is not useful in NB classification because we look at each term separately. The conditional independence assumption commits us to this way of processing the evidence.

Also, if we assumed different term distributions for each position $\tcposindex$ , we would have to estimate a different set of parameters for each $\tcposindex$ . The probability of bean appearing as the first term of a coffee document could be different from it appearing as the second term, and so on. This again causes problems in estimation owing to data sparseness.

For these reasons, we make a second independence assumption for the multinomial model, positional independence : The conditional probabilities for a term are the same independent of position in the document.

With conditional and positional independence assumptions, we only need to estimate $\Theta(M \vert\mathbb{C}\vert)$ parameters $P(\tcword_\tcposindex\vert\tcjclass)$ (multinomial model) or $P(e_i\vert\tcjclass)$ (Bernoulli model), one for each term-class combination, rather than a number that is at least exponential in

, the size of the vocabulary. The independence assumptions reduce the number of parameters to be estimated by several orders of magnitude.

To summarize, we generate a document in the multinomial model (Figure 13.4 ) by first picking a class $C=\tcjclass$ with $P(\tcjclass)$ where

is a random variable taking values from $\mathbb{C}$ as values. Next we generate term $\tcword_\tcposindex$ in position $\tcposindex$ with $P(X_\tcposindex=\tcword_\tcposindex\vert\tcjclass)$ for each of the

positions of the document. The $X_\tcposindex$ all have the same distribution over terms for a given $\tcjclass$ . In the example in Figure 13.4 , we show the generation of $\langle \tcword_1,\tcword_2,\tcword_3,\tcword_4,\tcword_5\rangle = \langle \term{Beijing}, \term{and}, \term{Taipei}, \term{join}, \term {WTO}\rangle$ , corresponding to the one-sentence document Beijing and Taipei join WTO.

For a completely specified document generation model, we would also have to define a distribution $P(n_d\vert\tcjclass)$ over lengths. Without it, the multinomial model is a token generation model rather than a document generation model.

We generate a document in the Bernoulli model (Figure 13.5 ) by first picking a class $C=\tcjclass$ with $P(\tcjclass)$ and then generating a binary indicator

for each term $\tcword_i$ of the vocabulary ( $1 \leq i \leq M$ ). In the example in Figure 13.5 , we show the generation of $\langle e_1,e_2,e_3,e_4,e_5,e_6\rangle = \langle 0,1,0,1,1,1\rangle$ , corresponding, again, to the one-sentence document Beijing and Taipei join WTO where we have assumed that and is a stop word.

**Table 13.3:** Multinomial versus Bernoulli model.
	multinomial model	Bernoulli model
event model	generation of token	generation of document
random variable(s)	$\xvar=\tcword$ iff $\tcword$ occurs at given pos	$\wvar_\tcword=1$ iff $\tcword$ occurs in doc
document representation	$\onedoc=\langle \tcword_1,\ldots,\tcword_\tcposindex,\ldots,\tcword_{n_d} \rangle , \tcword_\tcposindex \in V$	$\onedoc=\langle e_1,\ldots,e_i,\ldots,e_M \rangle ,$
		$e_i \in \{ 0,1\}$
parameter estimation	$\hat{P}(\xvar=\tcword\vert\tcjclass)$	$\hat{P}(\wvar_i=e\vert\tcjclass)$
decision rule: maximize	$\hat{P}(\tcjclass) \prod_{1 \leq \tcposindex \leq n_d} \hat{P}(\xvar=\tcword_\tcposindex\vert\tcjclass)$	$\hat{P}(\tcjclass) \prod_{\tcword_i \in V} \hat{P}(\wvar_{i}=e_i\vert\tcjclass)$
multiple occurrences	taken into account	ignored
length of docs	can handle longer docs	works best for short docs
# features	can handle more	works best with fewer
estimate for term the	$\hat{P}(\xvar=\mbox{the}\vert\tcjclass) \approx 0.05$	$\hat{P}(\wvar_{the}=1\vert\tcjclass) \approx 1.0$

We compare the two models in Table 13.3 , including estimation equations and decision rules.

Naive Bayes is so called because the independence assumptions we have just made are indeed very naive for a model of natural language. The conditional independence assumption states that features are independent of each other given the class. This is hardly ever true for terms in documents. In many cases, the opposite is true. The pairs hong and kong or london and english in Figure 13.7 are examples of highly dependent terms. In addition, the multinomial model makes an assumption of positional independence. The Bernoulli model ignores positions in documents altogether because it only cares about absence or presence. This bag-of-words model discards all information that is communicated by the order of words in natural language sentences. How can NB be a good text classifier when its model of natural language is so oversimplified?

**Table 13.4:** Correct estimation implies accurate prediction, but accurate prediction does not imply correct estimation.
			class selected
true probability $P(c\vert d)$	0.6	0.4
$\hat{P}(c)\prod_{1 \leq \tcposindex \leq n_d} \hat{P}(\tcword_\tcposindex\vert\tcjclass)$ (Equation 126)	0.00099	0.00001
NB estimate $\hat{P}(c\vert d)$	0.99	0.01

The answer is that even though the probability estimates of NB are of low quality, its classification decisions are surprisingly good. Consider a document

with true probabilities $P(c_1\vert d)= 0.6$ and $P(c_2\vert d)= 0.4$ as shown in Table 13.4 . Assume that

contains many terms that are positive indicators for

and many terms that are negative indicators for

. Thus, when using the multinomial model in Equation 126, $\hat{P}(c_1)\prod_{1 \leq \tcposindex \leq n_d} \hat{P}(\tcword_\tcposindex\vert c_1)$ will be much larger than $\hat{P}(c_2)\prod_{1 \leq \tcposindex \leq n_d} \hat{P}(\tcword_\tcposindex\vert c_2)$ (0.00099 vs. 0.00001 in the table). After division by 0.001 to get well-formed probabilities for $P(c\vert d)$ , we end up with one estimate that is close to 1.0 and one that is close to 0.0. This is common: The winning class in NB classification usually has a much larger probability than the other classes and the estimates diverge very significantly from the true probabilities. But the classification decision is based on which class gets the highest score. It does not matter how accurate the estimates are. Despite the bad estimates, NB estimates a higher probability for

and therefore assigns $\onedoc$ to the correct class in Table 13.4 . Correct estimation implies accurate prediction, but accurate prediction does not imply correct estimation. NB classifiers estimate badly, but often classify well.

Even if it is not the method with the highest accuracy for text, NB has many virtues that make it a strong contender for text classification. It excels if there are many equally important features that jointly contribute to the classification decision. It is also somewhat robust to noise features (as defined in the next section) and concept drift - the gradual change over time of the concept underlying a class like US president from Bill Clinton to George W. Bush (see Section 13.7 ). Classifiers like kNN knn can be carefully tuned to idiosyncratic properties of a particular time period. This will then hurt them when documents in the following time period have slightly different properties.

The Bernoulli model is particularly robust with respect to concept drift. We will see in Figure 13.8 that it can have decent performance when using fewer than a dozen terms. The most important indicators for a class are less likely to change. Thus, a model that only relies on these features is more likely to maintain a certain level of accuracy in concept drift.

NB's main strength is its efficiency: Training and classification can be accomplished with one pass over the data. Because it combines efficiency with good accuracy it is often used as a baseline in text classification research. It is often the method of choice if (i) squeezing out a few extra percentage points of accuracy is not worth the trouble in a text classification application, (ii) a very large amount of training data is available and there is more to be gained from training on a lot of data than using a better classifier on a smaller training set, or (iii) if its robustness to concept drift can be exploited.

**Table 13.5:** A set of documents for which the NB independence assumptions are problematic.
	(1)	He moved from London, Ontario, to London, England.
	(2)	He moved from London, England, to London, Ontario.
	(3)	He moved from England to London, Ontario.

In this book, we discuss NB as a classifier for text. The independence assumptions do not hold for text. However, it can be shown that NB is an optimal classifier (in the sense of minimal error rate on new data) for data where the independence assumptions do hold.