Naive Bayes Machine Learning Algorithm Explained

Naive Bayes is a classification algorithm which is based on Bayes’ theorem that naively assumes independence between features and gives the same weight (degree of significance) to all features in a given dataset.  As a result, the algorithm is founded on the idea that no one feature in a dataset is related to or has an influence on another feature. Even if weight and height are somewhat linked in terms of age prediction, the algorithm treats each feature as if it’s completely independent. Furthermore, the algorithm considers all characteristics to be equally important, giving same weightage or preference to each feature in dataset which means algorithm is founded on the idea that each feature has equal influence(which may or may not be the case actually) on predicting target feature.

For example, even if a person’s education level may have a higher impact on his or her wages than the number of children the person has, the algorithm still considers both variables(Education and Number of Children) to be equally relevant for predicting income of a person.

But in the real world datasets include features which may or may not be equally significant or independent from each other. But still, the Naive Bayes Algorithm is quite popular amongst Scientists/Data Scientists as it performs really well on large datasets as compared to other Machine Learning Algorithms. Moreover, as this algorithm is quite simple to implement and it runs reasonably fast, it can be used for situations like real-time forecasting. Naive Bayes Algorithm is also frequently used for text classification due to its superior performance over other Machine Learning Algorithms, which are often more difficult to implement, require a significant amount of time to train, and cannot make predictions quickly enough to be used in situations where real-time prediction is required.

Working of Naive Bayes Algorithm Explained

  1. Naive Bayes Algorithm starts with generating a summary of all occurrences of each class label for each feature in the dataset with respect to target feature class labels.
  2. Which then is used for calculating likelihood of occurring of one target class label given a certain combination of class labels of features in dataset.
  3. This likelihood of occurring of one target class label is then normalised in relation to likelihood of occurrence of each other class label in the target feature. Which gives Probability of occurrence of a target class label given a set of class labels of features.
  4. The total of probabilities for each target class label must sum up to one and the target class label which has highest probability is selected by algorithm as prediction for a given set of class labels of features.

That’s quite tricky to understand, let me show you how each of above four points can actually be applied to a dataset.

Example showing How Naive Bayes Algorithm Work

Let’s take an example of a situation in which a Bank needs to decide whether some person should get a loan or not? For making this decision of giving loan Bank already have details about Eduction, Income Level, Last Loan paid or not.

Based on these three metrics bank want to decide whether person will be able to pay back the load or not.

DataSet Table containing four columns - Number, Education Level, Income Level, Previous Load Paid used as an example for showing working of Naive Bayes Algorithm

Let’s now see how Naive Bayes Algorithm can be applied to above dataset. (Step-by-step)

Features in the DataSet and their Class Labels - Naive Bayes Algorithm
💡
Step 1
Naive Bayes Algorithm starts with generating a summary of all occurrences of each class label for each feature in the dataset with respect to target feature class labels.

Now we have identified what are class labels there in the dataset, next task would be to find occurrence of each of class label of a feature with respect to target feature class labels.

Let’s start with “Education Level” feature which has four class labels – Undergraduate, Postgraduate, High School, None. As Number of occurrences need to be calculated with respect to class labels of target feature “Previous Loan Paid” we need to consider class labels of it as well which are – Yes, No.

So we have 4 class labels of “Education Level” feature whose occurrence with respect to 2 class labels(which are – Yes, No) of “Previous Loan Paid” need to be calculated. This means takes one class label of “Education Level” feature and check how many it is Yes, how many times it No and count occurrences. Do this for all of class labels in “Education Level” feature and put together results in a table like shown below.

Occurrences of class labels(Undergraduate, Postgraduate, High School, None) in "Education Level" feature of dataset - Naive Bayes Algorithm

Do this for all of the features in the DataSet including Target Feature

Occurrences of class labels(Low, Medium, High) in "Income Level" feature of dataset - Naive Bayes Algorithm
Occurrences of class labels(Yes, No) in "Previous Loan Paid" feature of dataset which is Target Feature for given dataset - Naive Bayes Algorithm
💡
Step 2
Which then is used for calculating likelihood of occurring of one target class label given a certain combination of class labels of features in dataset.

Let’s say we want to calculate Probability of (Preview Load Paid = Yes) given that person has Undergraduate Education Level and Medium level of income.

Here we want to calculate P[Yes | Undergraduate, Medium], which in language of Mathematics can be read as Probability of Previous Loan Paid being Yes given that person has Undergraduate Education and Medium level of income.

This can be calculated using Bayes Theorem, that’s the reason why algorithm is named as Naive Bayes Algorithm.

As per Bayes Theorem: –

P[A1E]= likelihood [A1E] likelihood [A1E]+likelihood[A2E]++ likelihood [AnE]P\left[A_{1} \mid E\right]=\frac{\text { likelihood }\left[A_{1} \mid E\right]}{\text { likelihood }\left[A_{1} \mid E\right]+\operatorname{likelihood}\left[A_{2} \mid E\right]+\cdots+\text { likelihood }\left[A_{n} \mid E\right]}
Likelihood [A1E]=Likelihood[A1E1]Likelihood[A1E2]Likelihood[A1En]Likelihood[A1]\text {Likelihood }\left[\mathrm{A}_{1} \mid E\right]=\mathrm{Likelihood}\left[\mathrm{A}_{1} \mid \mathrm{E}_{1}\right]^{*} \mathrm{Likelihood}\left[\mathrm{A}_{1} \mid \mathrm{E}_{2}\right]^{*} \ldots * \mathrm{Likelihood}\left[\mathrm{A}_{1} \mid \mathrm{E}_{\mathrm{n}}\right]^{*} \mathrm{Likelihood}\left[\mathrm{A}_{1}\right]
  • Here A1 is an event
  • E is a set of features containing individual features E1, E2, E3, ……………..,En

In the example, which I am describing here A1 = Yes, E = {Undergraduate, Medium}

P[YesUndergraduate,Medium]= Likelihood [YesUndergraduate,Medium] Likelihood [YesUndergraduate,Medium]+Likelihood[NoUndergraduate,Medium]P\left[Yes \mid Undergraduate, Medium\right]=\frac{\text { Likelihood }\left[Yes \mid Undergraduate, Medium\right]}{\text { Likelihood }\left[Yes \mid Undergraduate, Medium\right]+\operatorname{Likelihood}\left[No \mid Undergraduate, Medium \right]}

 Likelihood [YesUndergraduate,Medium]=Likelihood[YesUndergraduate]Likelihood[YesMedium]Likelihood[Yes]\text { Likelihood }\left[\mathrm Yes \mid Undergraduate, Medium\right]=\mathrm{Likelihood}\left[\mathrm Yes \mid \mathrm Undergraduate\right]^{*} \mathrm{Likelihood}\left[\mathrm Yes\mid \mathrm Medium\right]^{*}Likelihood\left[\mathrm Yes\right]

 Likelihood [NoUndergraduate,Medium]=Likelihood[NoUndergraduate]Likelihood[NoMedium]Likelihood[No]\text { Likelihood }\left[\mathrm No \mid Undergraduate, Medium\right]=\mathrm{Likelihood}\left[\mathrm No \mid \mathrm Undergraduate\right]^{*} \mathrm{Likelihood}\left[\mathrm No\mid \mathrm Medium\right]^{*}Likelihood\left[\mathrm No\right]

Likelihood [Yes | Undergraduate] = 2/9
Likelihood [Yes | Medium] = 4/9
Likelihood [Yes] = 9/15
Likelihood [Yes | Undergraduate, Medium] = 2/9 * 4/9 * 9/15 = 0.059
Likelihood [Yes | Undergraduate, Medium] = 0.059

Likelihood [No | Undergraduate] = 2/6
Likelihood [No | Medium] = 1/6
Likelihood [No] = 6/15
Likelihood [No | Undergraduate, Medium] = 2/6 * 1/6 * 6/15 = 0.022
Likelihood [No | Undergraduate, Medium] = 0.022

P[Yes | Undergraduate, Medium] = 0.059/(0.059 + 0.022) = 0.72 = 73%
P[No | Undergraduate, Medium] = 1 – P[Yes | Undergraduate, Medium] = 1 – 0.72 = 0.28 = 28%

Concluding Probability of loan being paid back to back is 73% if a person have Undergraduate Education and Medium Level Income (between $50,000 to $100,000 per year).

So whenever a new loan application is submitted by a person to the bank and if person have Undergraduate Education, Medium Level of income then its 73% chances that person will pay back the loan. Also if in case these chances are lower like 50% or less then bank should reject this person’s loan application.

This is how Naive Bayes Algorithm works. I’ve explained using a much simpler example, so that it’s easy to understand.

Calculation of Likelihood, Probability - Naive Bayes Algorithm

Key Characteristics of Naive Bayes Algorithm

  1. Computation Efficiency – Training time is linear with respect to number of training examples(number of instances in Training DataSet), number of attributes(how many features are there in DataSet). Classification time is linear with respect to number of features in DataSet and its unaffected by number of training examples(number of instances in Training DataSet).
  2. Low Variance – Because Naive Bayes Algorithm does not utilize search, it has low variance but high bias.
  3. Incremental Learning – Naive Bayes Algorithm operated from estimates of low order probabilities that are derived from training data. These can readily be updated as new training data is acquired by model.
  4. Robustness in the face of noise – Naive Bayes Algorithm always uses all attributes(features in DataSet) for all predictions and hence is relatively insensitive to noise in the examples to be classified. Because it uses probabilities, it is also relatively insensitive to noise in the training data.
  5. Robustness in the face of missing values – Because Naive Bayes Algorithm always uses all attributes(features in DataSet) for all predictions, if one attribute value is missing, information from other attributes is still used, resulting in graceful degradation in performance. It is also relatively insensitive to missing attribute values in the training data due to its probabilistic framework.

Gagan

Hi, there I'm founder of ComputerScienceHub(Started this to bring useful Computer Science information just at one place). Personally I've been doing JavaScript, Python development since 2015(Been long) - Worked upon couple of Web Development Projects, Did some Data Science stuff using Python. Nowadays primarily I work as Freelance JavaScript Developer(Web Developer) and on side-by-side managing team of Computer Science specialists at ComputerScienceHub.io

Leave a Reply

Your email address will not be published. Required fields are marked *

Recent Posts