Probability for Machine Learning:Naive Bayes Classifier algorithm
Table of contents
Naive Bayes classifier is a supervised learning algorithm that is based on Bayes' theorem and used for solving classification problems. It is mainly used in text classification that includes a high-dimensional training dataset. Naive Bayes classifier is one of the simple and most effective classification algorithms which helps in building the fast machine learning models that can make quick predictions. It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
Naive Bayes Assumption
The naive Bayes assumption is that each feature makes an independent and equal contribution to the outcome. That means that the algorithm assumes that the presence or absence of a feature does not affect the presence or absence of any other feature. For example, if we want to classify an email as spam or not spam based on its words, we assume that each word is independent of any other word in the email.
This assumption is often unrealistic in real-world data, but it simplifies the computation and often works well in practice.
Naive Bayes Classifier
A naive Bayes classifier is a machine learning model that applies Bayes' theorem with the naive assumption to predict the class of a given data point. It can be represented as:
$$P(C∣X)=P(X∣C)P(C)P(X)P(C|X) = \frac{P(X|C)P(C)}{P(X)}$$
Where,
P(C|X) is the posterior probability: the probability of class C given the features X.
P(X|C) is the likelihood probability: the probability of features X given that they belong to class C.
P(C) is the prior probability: the probability of class C in the data.
P(X) is the marginal probability: the probability of features X in the data.
The naive Bayes classifier predicts the class that has the highest posterior probability for a given data point. To do so, it needs to estimate the likelihood and prior probabilities from the training data.
There are different types of naive Bayes classifiers depending on how they estimate the likelihood probabilities. Some common types are:
Gaussian naive Bayes: assumes that each feature follows a normal distribution.
Multinomial naive Bayes: assumes that each feature follows a multinomial distribution, suitable for discrete counts such as word frequencies.
Bernoulli naive Bayes: assumes that each feature follows a Bernoulli distribution, suitable for binary features such as presence or absence of a word.
Example
Let us consider an example of text classification using multinomial naive Bayes. Suppose we have a dataset of movie reviews and their sentiments (positive or negative). We want to build a classifier that can predict the sentiment of a new review based on its words.
Here is a sample of our dataset:
Review | Sentiment |
I loved this movie | Positive |
This movie was boring | Negative |
It was an average movie | Negative |
The movie had great actors | Positive |
I hated this movie | Negative |
To apply naive Bayes, we need to convert each review into a vector of word frequencies. For example, if we use only four words (loved, boring, great, hated) as our features, then we can represent each review as:
Review | loved | boring | great | hated | Sentiment |
I loved this movie | 1 | 0 | 0 | 0 | Positive |
This movie was boring | 0 | 1 | 0 | 0 | Negative |
It was an average movie | 0 | 0 | 0 | 0 | Negative |
The movie had great actors | 0 | 0 | 1 | 0 | Positive |
I hated this movie | 0 | 0 | 0 | 1 | Negative |
Now, we can estimate the prior and likelihood probabilities from the data. The prior probabilities are simply the proportions of each class in the data. For example, the probability of positive sentiment is:
$$P(Positive) = \frac{2}{5} = 0.4$$
The likelihood probabilities are the conditional probabilities of each feature given in each class. For example, the probability of the word "loved" given positive sentiment is:
$$P(loved|Positive) = \frac{1}{2} = 0.5$$
We can use Laplace smoothing to avoid zero probabilities for unseen features. Laplace smoothing adds a small constant (usually 1) to each count and adjusts the denominator accordingly. For example, the probability of the word "boring" given positive sentiment with Laplace smoothing is:
$$P(boring|Positive) = \frac{0 + 1}{2 + 4} = \frac{1}{6}$$
We can calculate all the likelihood probabilities in a similar way and store them in a table. Here is the table for our example:
Here is the table in a markdown format:
Here is the table in a markdown format:
Word | P(Word | Positive) | P(Word | Negative) |
loved | 1/6 | 1/8 | ||
boring | 1/6 | 3/8 | ||
great | 3/6 | 1/8 | ||
hated | 1/6 | 3/8 |
Now, we are ready to make predictions for new reviews. Suppose we have a new review: "This movie was great". To predict its sentiment, we need to calculate the posterior probabilities for both classes and compare them. We can use the naive Bayes formula as:
$$P(Positive|Review) = \frac{P(Review|Positive)P(Positive)}{P(Review)} $$
$$P(Negative|Review) = \frac{P(Review|Negative)P(Negative)}{P(Review)}$$
We can ignore the marginal probability P(Review) since it is constant for both classes. We can also use the log transformation to avoid underflow issues due to multiplying small probabilities. The log transformation converts multiplication into addition and division into subtraction. Using log transformation, we get:
$$P(Positive|Review) = \log P(Review|Positive) + \log P(Positive) $$
$$P(Negative|Review) = \log P(Review|Negative) + \log P(Negative)$$
Now, we can plug in the values from our data and table to get:
$$P(Positive|Review) = \log P(loved|Positive) + \log P(boring|Positive) + \log P(great|Positive) + \log P(hated|Positive) + \log P(Positive)$$
$$ P(Negative|Review) = \log P(loved|Negative) + \log P(boring|Negative) + \log P(great|Negative) + \log P(hated|Negative) + \log P(Negative)$$
$$ P(Positive|Review) = -1.79 - 1.79 - 0.51 - 1.79 - 0.92 = -6.80$$
$$ P(Negative|Review) = -2.08 - 0.98 - 2.08 - 2.08 - 1.10 = -8.32$$
Since log P(Positive|Review) is greater than log P(Negative|Review), we predict that the review has a positive sentiment.
Conclusion
Naive Bayes classifier is a simple yet powerful machine learning algorithm that can be used for various classification tasks. It is easy to implement and fast to train and test. It can handle high-dimensional data and deal with missing values and noise. However, it also has some limitations, such as the naive assumption of feature independence and the sensitivity to imbalanced data and zero-frequency problems.
Naive Bayes classifier is a good choice for text classification, spam filtering, sentiment analysis, document categorization, and many other applications that involve discrete features and categorical outcomes.