Introduction to Machine Learning (101 Guide)

In short, Machine Learning (ML) is a subset of Artificial Intelligence (AI) that comprises a large variety of algorithms which can learn without having to be manually programmed to accomplish a specific task. ML algorithms are created to find patterns in the input data so that those patterns can be used to make informed predictions in the future.

In short, Machine Learning (ML) is a subset of Artificial Intelligence (AI) that comprises a large variety of algorithms which can learn without having to be manually programmed to accomplish a specific task. ML algorithms are created to find patterns in the input data so that those patterns can be used to make informed predictions in the future.

Types of Machine Learning

Broadly speaking there are two types of Machine Learning – Supervised Machine Learning and Unsupervised Machine Learning. Although there also exist another type of ML called Semi-Supervised Machine Learning, which is just combination of both of Supervised and Semi-Supervised.

What is Supervised Machine Learning?

Supervised Machine Learning is the process of determining the relationship between a given set of features (or variables) and a target value, which is also known as a label or a classification. Which means building ML Models that can take in certain input data and spit out a predicted value.

Let’s understand this by taking the example of – “Whether a bank should give House Loan to an applicant or not?” based upon information submitted by applicant to bank. Let’s assume that in the House Load Application, applicant submitted Age, Sex, Education Level, Income Level, Marital Status, Demographics, Previous Loans paid.

AgeSexEducation LevelIncome Level
(per year)
Marital StatusDemographicsPrevious Load Paid
30FemaleUndergraduate98,000 USDYesNew YorkYes
27MaleUndergraduate120,000 USDYesAustinNo
29FemaleHigh School67,000 USDNoLAYes
45MaleNone54,000 USDYesBaltimoreNo
24FemaleNone43,000 USDNoGeorgiaNo
64MaleUndergraduate180,000 USDNoBay AreaYes

Based upon this data, a Supervised Machine Learning Model can be trained which can provide a yes or no answer to question “Give the loan to applicant?” for newer applicants.

Supervised Machine Learning models can be further divided into Classification Tasks and Regression Tasks.

Classification Tasks in Supervised Machine Learning

Classification Tasks are used to build models out of data with discrete categories as labels. For example – A classification task can be used to predict whether a person will pay back loan or not. You have more than two discrete categories, such as predicting ranking of a horde in a race, but they must be a finite number.

4 circles coloured as - Purple, Yellow, Blue, Brown showing clustering of data

In the above image, Machine Learning Model is classifying observation in dataset as yellow, blue, blackish or pink.
Most classification tasks output the prediction as probability of an instance to belong to each output label. The assigned label is one with highest probability.

Supervised Classification Machine Learning Algorithms

  1. Decision Trees – This algorithm follow a tree-like architecture that simulates decision process following a series of decisions, considering one variable at a time.
  2. Naive Bayes Classifier – This algorithm relies on a group of probabilistic equations based on Bayes’ theorem, which assumes independence among features. It has ability to consider several attributes.
  3. Artificial Neural Networks (ANNs) – These replicate the structure and performance of a biological neural network to perform pattern recognition tasks. An ANN consists of interconnected neurons, laid out with a set architecture. They pass information to one another until a result is achieved.

Regression Tasks in Supervised Machine Learning

Regression Tasks are used for data with continuous quantities as labels. For example – A regression task can be used for predicting house prices. This means that value is represented by a quantity and not by a set of possible outputs. Output labels can be of integer or float types.

Some commonly used Supervised Machine Learning algorithms for Regression Tasks are Linear Regression, Regression Trees, Support Vector Regression, Artificial Neural Networks (ANNs).

Evaluating Performance of a Supervised Machine Learning Model

Model evaluation is essential for the development of successful models that perform well not just on the data that was used to train the model, but also on data that it has not seen yet.  When dealing with supervised machine learning problems, the process of assessing the model is made particularly simple as ground truth is already known which can be compared to prediction of model.

When applying a model to unseen data that does not have a label class to compare it to, determining the accuracy percentage of the model is essential. For example, a model with an accuracy of 95% may lead you to believe that the chances of making an accurate forecast are great, and as a result, the model should be considered trustworthy. But definitely that assumption can be wrong as well because metric “accuracy of model” implies what? is known. Moreover a specific performance measurement metric for a Supervised Machine Learning Model should be selected on case by case basis. Because for some models it would be better to use one metric while same metric can imply something else for other model. So be careful while selecting a specific metric to measure performance of a model.

Evaluating models’ performance should be done on two types of datasets – Validation DataSet to fine-tune the model and Testing DataSet to evaluate how well model will function when applied to data which it does not know about.

Metrics used for Measuring Performance of a Supervised Classification Machine Learning Model

Confusion Metric

Confusion Matrix is a table that contains information about performance of a model. In Confusion Metric table columns represent instances which belong to a predicted class, while rows represent instances that actually belong to a class.

Let’s understand what exactly is “Confusion Metric” by taking an example of how many images in a given dataset of 500 images are of dogs.

Prediction👉
Actual👇
Dog ImageNot Dog Image
Dog Image270230
Not Dog Image150350

Each cell in a Confusion Matrix can be classified as True Positives (TP), False Positives (FP), True Negatives (TN), False Negatives (FN).

Cell of Confusion MetricDescriptionExample
True Positives (TP)Refers to instances that model correctly classified the event as positiveCorrectly classifying image of the dog as god image
False Positives (FP)Refers to the instances that model incorrectly classified the event as positiveImages of other animals being classified as Dog Images by model
True Negatives (TN)Refers to the instances that model correctly classified event as negativeImages of other animals are being classified as not images of Dog by model
False Negatives (FN)Refers to the instances that model incorrectly classified the event as negativeImages of dog being classified as not images of Dog by model
Accuracy

Accuracy of a model measures its capability to correctly classify all instances. It can be calculated by summing up number of True Positives (TP) and True Negatives (TN) then dividing by total number of instances.

Accuracy = (TP + TN)/Total Number of Instances

Precision

Precision measures the model’s ability to correctly classify positive labels by comparing it with total number of instances predicted as positive.

Precision can be calculated by taking ratio of True Positives (TP) and sum of True Positives (TP) and False Positives (FP).

Precision = TP divided by sum of TP, FP

Recall

Recall measures the number of correctly predicted positive labels against all positive labels.

Recall can be calculated by taking ratio of True Positives (TP) and sum of True Positives (TP) and False Negatives (FN).

Recall = TP divided by sum of TP, FN

Metrics used for Measuring Performance of a Supervised Regression Machine Learning Model

Considering that regression tasks are those where final output is continuous rather than being categorical, the performance of model can be measured by comparing predicted value with actual value.

For example – The performance of a Supervised Regression Machine Learning Model which makes prediction about price of a house in a locality can be measured by comparing actual price of house with that predicted by model.

Let’s say actual price of a house in a locality is 700,000 USD but our model is predicting price of that being 699,999 USD which is pretty close, so we can say that given model is efficient enough as difference between predicted and actual value is quite low.

For measuring this difference between predicted and actual values Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE) can be used.

Mean Absolute Error (MAE)

It measures average absolute difference between predicted values and actual values, without taking into account the direction of error.

    \[ M A E=\frac{1}{m} * \sum_{i=1}^{m}\left|y_{i}-\widehat{y}_{i}\right| \]

  • m = Number of total instances
  • yi = Actual Value
  • ŷi = Predicted Value
Actual House Price (yi)Predicted House Price (ŷi)Predicted – Actual (yi – ŷi)
500,000 USD499,012 USD988 USD
650,000 USD590,918 USD59082 USD
839,193 USD832,039 USD7154 USD
127,092 USD120,043 USD7049 USD
983,028 USD980,832 USD2196 USD

    \[\sum_{i=1}^{m}\left|y_{i}-\widehat{y_{i}}\right| \]

= 76469 USD

MAE = 76469/5 = 15293.8 USD

Root Mean Square Error (RMSE)

It measures average magnitude of error between actual value and predicted value. It can be calculated by taking square root of average of squared difference between actual, predicted values.

    \[ R M S E=\sqrt{\frac{1}{m} \cdot \sum_{i=1}^{m}\left(y_{i}-\hat{y}_{i}\right)^{2}} \]

  • m = Number of total instances
  • yi = Actual Value
  • ŷi = Predicted Value
Actual House Price (yi)Predicted House Price (ŷi)Predicted – Actual (yi – ŷi)Predicted – Actual
(yi – ŷi)2
500,000 USD499,012 USD988 USD976144 USD
650,000 USD590,918 USD59082 USD3490682724 USD
839,193 USD832,039 USD7154 USD51179716 USD
127,092 USD120,043 USD7049 USD49688401 USD
983,028 USD980,832 USD2196 USD4822416 USD

    \[ \sum_{i=1}^{m}\left(y_{i}-\hat{y}_{i}\right)^{2}\]


= 3597349401

RMSE = Square Root (3597349401/5) = Square Root (719469880.2) = 26822.93 USD

Which one metric out of MAE, RMSE to use for measuring the performance of Supervised Regression Machine Learning Model?

Both of Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) expresses the average error, in a range from 0 to infinity, where lower the value the better the performance of model. The main difference between these two metrics is that MAE assigns same weight of importance to all errors, while RMSE squares the error, assigning higher weight to larger errrors.

What is Unsupervised Machine Learning?

Consider this, RMSE metric is especially useful in cases where larger errors should be penalized, meaning that outliers are taken into account in the measurement of performance. For example – RMSE metric can be used when a value that is off by 4 is more than twice as bad as being off by 2. The MAE, on the other hand, is used when a value that is off by 4 is just twice as bad as a value off by 2.

Unsupervised Machine Learning consists of fitting the model to the data without any relationship with an output label, also known as unlabeled data. This means that Unsupervised Machine Learning algorithms try to understand data and find patterns in it. For example – Unsupervised Machine Learning can be used for profiling kids of a school into different categories based upon their age.

Assigning labels to children based upon their age

Unsupervised Machine Learning is further divided based upon what type of task its doing. One of common type of it is Clustering Tasks.

Clustering in Unsupervised Machine Learning

Clustering is a type of Unsupervised Machine Learning technique where the objective is to arrive at conclusions based on the patterns found within unlabelled input data. This technique is mainly used to segregate large data into subgroups in order to make informed decisions.

Clustering tasks involves creating groups of data while complying with the condition that instances from one group differ visibly from the instances within other groups. The output of any clustering algorithm is a label, which assigns instance to the cluster of that label.

One large circle dividing into 4 smaller circles - an example showing Classification process in Unsupervised Machine Learning

The preceding diagram shows a group of clusters, each of a different size based upon the number of instances that belong to each cluster. Considering this, even though clusters do not need to have the same number of instances, it is possible to set the minimum number of instances per cluster to avoid overfitting the data into tiny clusters of very specific data.

What are types of Clustering(Unsupervised Machine Learning)?

Clustering Algorithms can classify data points into different clusters by using some similarity metric as a parameter to identify which data points are closely related to each other and assign those data points to a cluster. Based upon what’s similarity metric Clustering Algorithms can be of two types – Hard or Soft.

Hard Clustering means similarity metric is – Assigning data point completely to a cluster. Soft Clustering means similarity metric is – Determining the likelihood of a point belonging to a given cluster by giving a probability to it.

Considering that clusters are created based on the similarity between data points, Clustering Algorithms can be divided into four categories.

Connectivity-based Models – The approach to similarity used by this model is based on the closeness of data points in a data space. It is possible to create clusters by assigning all data points to a single cluster and then splitting the data into smaller clusters as the distance between points grows. Similarly, the algorithm can begin by allocating each data point to a unique cluster and then combining data points that are near to one another. Hierarchical Clustering is an example of Connectivity-based Machine Learning Algorithm.

Example showing how does Connectivity based Machine Learning models work using circles demonstration

Density-based Models – The approach to similarity used by this model is based on density of data points in data space. Meaning if one part of data space is more denser than other then denser part will be considered a cluster and less denser part will also be considered as a cluster. So the partitioning rule for Density-based Models is “Separate Higher Denser Space from Lower Denser Data Space”. Machine Learning Algorithm DBSCAN is an example of Density-based Model.

Example showing how does Density based Machine Learning models work using circles demonstration

Distribution-based Models – The approach to similarity used by this model is based on the probability that all data points from a cluster would follow the same distribution, such as the Gaussian Distribution. An example of such a model is the Gaussian Mixture Algorithm, which assumes that all data points come from a mixture of a finite number of Gaussian Distributions. So in simple words, Distribution-based Models means each cluster have same distribution of data.

Example showing how does Distribution based Machine Learning models work

Centroid-based Models – These models are based on algorithms that define a centroid for each cluster, which is updated constantly by an iterative process. The data points are assigned to cluster where their proximity to the centroid is minimized. An example of such a model is K-Means algorithm.

Common Unsupervised Machine Learning Algorithms

  1. K-Means Algorithm – This focus on separating the instances into n clusters of equal variance by minimising sum of the squared distances between two points.
  2. Mean-Shift Clustering Algorithm – This creates clusters by using centroids. Each instance becomes a candidate for centroid to be the mean of the points in that cluster.
  3. Density-Based Spatial Clustering of Applications with Noise (DBSCAN) – This determines clusters as areas with a high density of points, separated by areas with low density.

K-Means Algorithm

The k-means algorithm is used to model data without a labeled class. It involves dividing the data into k number of subgroups.

k-means is a classification algorithm, and the basic working principle for classification algorithms is using some similarity metric to classify data into different clusters. For k-means algorithm similarity metric is “minimising distance between points in cluster and its centroid”.

The final output of k-means algorithm is each data point linked to a cluster. Moreover centroids of clusters made by k-means algorithm represent a collection of features that can be used to define nature of data points which belong there.

Understanding k-Means Algorithm
  1. Initialisation – Based upon number of clusters defined (k value), algorithm picks up k random data points from a given dataset and assume those points as centroids.
  2. Assignment – Algorithm then picks up each point in dataset one by one and computes its distance from all of k centroids. Then point is allocated to centroid from which it has minimum distance. This process is done for all the points in dataset, so by the end of this process, we get k number of clusters.
  3. Update – In second step, we have created k number of clusters. Now by taking the mean of all data points belonging to a cluster, new centroids for each cluster are computed and then again Step 2 Assignment is repeated. This process continues until number of defined iterations are not completed or data points do not change anymore from one cluster to another during Step 2 Assignment.
Choosing the value of k for k-Means Algorithm

One of the metrics that’s used to measure the performance of k-Means Algorithm is the mean distance of data points from the centroid of cluster that they belong to. However, this measure can be counterproductive as the higher number of clusters, the smaller distance between data points and its centroid, which may result in number of clusters (K) matching the number of data points, thereby harming the purpose of Clustering Algorithm.

To avoid this, you can plot average distance between data points and the cluster centroid against number of clusters. The appropriate number of clusters corresponds to breaking point of the plot, where rate of decrease drastically changes.

Choosing value of K in K-Means Algorithm using Graph of "Average cluster distance in relation to its centroid"(y-axis) and "number of clusters"(x-axis).
Value at elbow
of graph will be selected as K value for K-Means Algorithm

Mean-Shift Algorithm

The mean-shift algorithm works by assigning each data point a cluster based on the density of the data points in the data space, also known as the mode in a distribution function. Contrary to the k-means algorithm, the mean-shift algorithm does not require you to specify the number of clusters as a parameter.

The algorithm works by modeling the data points as a distribution function, where high-density areas (high concentration of data points) represent high peaks. Then, the general idea is to shift each data point until it reaches its nearest peak, which becomes a cluster.

Understanding Mean-Shift Algorithm

The first step of the mean-shift algorithm is to represent the data points as a density distribution. To do so, the algorithm builds upon the idea of Kernel Density Estimation (KDE), which is a method that’s used to estimate the distribution of a set of data.

3D-Graph showing X1, X2 values on X, Y axis and "density value" on Z axis.
In the plane of graph, all points are coloured and there are 2 coloured cone peaks

In the preceding diagram, the dots at the bottom of the shape represent the data points that the user inputs, while the cone-shaped lines represent the estimated distribution of the data points. The peaks (high-density areas) will be the clusters. The process of assigning data points to each cluster is as follows:

  1. A window of a specified size (bandwidth) is drawn around each data point.
  2. The mean of the data inside the window is computed.
  3. The center of the window is shifted to the mean.

Steps 2 and 3 are repeated until the data point reaches a peak, which will determine the cluster that it belongs to.

The bandwidth value should be coherent with the distribution of the data points in the dataset. For example, for a dataset normalised between 0 and 1, the bandwidth value should be within that range, while for a dataset with all values between 1,000 and 2,000, it would make more sense to have a bandwidth between 100 and 500.

In the following diagram, the estimated distribution is represented by the lines, while the data points are the dots. In each of the boxes, the data points shift to the nearest peak. All the data points in a certain peak belong to that cluster.

Showing workings of Mean-Shift Algorithm using graphs

The number of shifts that a data point has to make to reach a peak depends on its bandwidth (the size of the window) and its distance from the peak.

DBSCAN Algorithm

Density-based spatial clustering of applications with noise(DBSCAN) algorithm groups together points that are close to each other and mark those points that are further away with no close neighbors as outliers. According to this, the algorithm classifies data points based on the density of all data points in the data space.

Understanding DBSCAN Algorithm

DBSCAN algorithm requires two main parameters – epsilon and the minimum number of observations.

  • Epsilon – Maximum distance that defines radius within which algorithm searches for neighbours.
  • Minimum number of observations – Number of data points required to form a high-density area.
Graphs showing working of DBSCAN Algorithm

Based upon Epsilon, Minimum number of observations values DBSCAN clusters data points in data space as – Core Point, Border Point, Noise Point.

Point in DBSCAN ClusterDescription
Core PointA point that has at least minimum number of data points within its epsilon radius
Border PointA point that is within epsilon radius of a core point, but does not have required number of data points within its own radius
Noise PointAll points that do not meet criterion for being described as – Core Point or Border Point

Applications of Machine Learning

Machine Learning can be used for doing a variety of tasks, some of these are as following: –

Real-Time Price Prediction

Companies such as Uber, which need to adjust their pricing in real-time in response to demand, rely on Machine Learning to accomplish this task. In this case, machine learning models can take into account a variety of factors such as the day of the week, the time of day, the number of drivers currently available in a specific area, and so on to predict how much a trip will cost. Similarly, some people use Machine Learning Models to make predictions about the prices of stocks and then do trading in Stock Market accordingly to make large profits.

Recommendation Systems

As the web has grown up, websites have moved from being simple HTML page to much more. Nowadays on the web, users can do shopping, watch videos or share photos/videos with their friends on social media.

As the web has become a crucial part of the lives of people, that’s why companies like Twitter, Facebook, Youtube, Amazon, etc. have built up “Machine Learning-based Recommendation Systems“. Which recommend content or products to users. These predictions about content or products are based on previous user activity (and many other factors).

If, for example, you purchased a table from Amazon’s website one month ago, when you return to the site, it may recommend you to buy a chair as you have already purchased a table.

Similarly, these Recommendation Models can consider multiple factors at one time and can make content or product recommendations.

Spam Email Filtering

It is possible to use Machine Learning Algorithms to classify emails as spammy or non-spammy based on the content they contain. Because most spammy emails contain similar types of phishing information, it is easier to train a Machine Learning classifying algorithm that can distinguish between spammy and non-spammy emails when they arrive in the inbox.
When it comes to identifying spammy emails, Google Mail (Gmail) has a built-in machine learning algorithm that is quite effective.

Gagan

Hi, there I'm founder of ComputerScienceHub(Started this to bring useful Computer Science information just at one place). Personally I've been doing JavaScript, Python development since 2015(Been long) - Worked upon couple of Web Development Projects, Did some Data Science stuff using Python. Nowadays primarily I work as Freelance JavaScript Developer(Web Developer) and on side-by-side managing team of Computer Science specialists at ComputerScienceHub.io

Leave a Reply

Your email address will not be published. Required fields are marked *

Recent Posts