Unsupervised Machine Learning consists of fitting the model to the data without any relationship with an output label, also known as unlabeled data. This means that Unsupervised Machine Learning algorithms try to understand data and find patterns in it. For example – Unsupervised Machine Learning can be used for profiling kids of a school into different categories based upon their age.
Unsupervised Machine Learning is further divided based upon what type of task its doing. One of common type of it is Clustering Tasks.
Clustering in Unsupervised Machine Learning
Clustering is a type of Unsupervised Machine Learning technique where the objective is to arrive at conclusions based on the patterns found within unlabelled input data. This technique is mainly used to segregate large data into subgroups in order to make informed decisions.
Clustering tasks involves creating groups of data while complying with the condition that instances from one group differ visibly from the instances within other groups. The output of any clustering algorithm is a label, which assigns instance to the cluster of that label.
The preceding diagram shows a group of clusters, each of a different size based upon the number of instances that belong to each cluster. Considering this, even though clusters do not need to have the same number of instances, it is possible to set the minimum number of instances per cluster to avoid overfitting the data into tiny clusters of very specific data.
What are types of Clustering(Unsupervised Machine Learning)?
Clustering Algorithms can classify data points into different clusters by using some similarity metric as a parameter to identify which data points are closely related to each other and assign those data points to a cluster. Based upon what’s similarity metric Clustering Algorithms can be of two types – Hard or Soft.
Hard Clustering means similarity metric is – Assigning data point completely to a cluster. Soft Clustering means similarity metric is – Determining the likelihood of a point belonging to a given cluster by giving a probability to it.
Considering that clusters are created based on the similarity between data points, Clustering Algorithms can be divided into four categories.
Connectivity-based Models – The approach to similarity used by this model is based on the closeness of data points in a data space. It is possible to create clusters by assigning all data points to a single cluster and then splitting the data into smaller clusters as the distance between points grows. Similarly, the algorithm can begin by allocating each data point to a unique cluster and then combining data points that are near to one another. Hierarchical Clustering is an example of Connectivity-based Machine Learning Algorithm.
Density-based Models – The approach to similarity used by this model is based on density of data points in data space. Meaning if one part of data space is more denser than other then denser part will be considered a cluster and less denser part will also be considered as a cluster. So the partitioning rule for Density-based Models is “Separate Higher Denser Space from Lower Denser Data Space”. Machine Learning Algorithm DBSCAN is an example of Density-based Model.
Distribution-based Models – The approach to similarity used by this model is based on the probability that all data points from a cluster would follow the same distribution, such as the Gaussian Distribution. An example of such a model is the Gaussian Mixture Algorithm, which assumes that all data points come from a mixture of a finite number of Gaussian Distributions. So in simple words, Distribution-based Models means each cluster have same distribution of data.
Centroid-based Models – These models are based on algorithms that define a centroid for each cluster, which is updated constantly by an iterative process. The data points are assigned to cluster where their proximity to the centroid is minimized. An example of such a model is K-Means algorithm.
Common Unsupervised Machine Learning Algorithms
- K-Means Algorithm – This focus on separating the instances into n clusters of equal variance by minimising sum of the squared distances between two points.
- Mean-Shift Clustering Algorithm – This creates clusters by using centroids. Each instance becomes a candidate for centroid to be the mean of the points in that cluster.
- Density-Based Spatial Clustering of Applications with Noise (DBSCAN) – This determines clusters as areas with a high density of points, separated by areas with low density.
The k-means algorithm is used to model data without a labeled class. It involves dividing the data into k number of subgroups.
k-means is a classification algorithm, and the basic working principle for classification algorithms is using some similarity metric to classify data into different clusters. For k-means algorithm similarity metric is “minimising distance between points in cluster and its centroid”.
The final output of k-means algorithm is each data point linked to a cluster. Moreover centroids of clusters made by k-means algorithm represent a collection of features that can be used to define nature of data points which belong there.
Understanding k-Means Algorithm
- Initialisation – Based upon number of clusters defined (k value), algorithm picks up k random data points from a given dataset and assume those points as centroids.
- Assignment – Algorithm then picks up each point in dataset one by one and computes its distance from all of k centroids. Then point is allocated to centroid from which it has minimum distance. This process is done for all the points in dataset, so by the end of this process, we get k number of clusters.
- Update – In second step, we have created k number of clusters. Now by taking the mean of all data points belonging to a cluster, new centroids for each cluster are computed and then again Step 2 Assignment is repeated. This process continues until number of defined iterations are not completed or data points do not change anymore from one cluster to another during Step 2 Assignment.
Choosing the value of k for k-Means Algorithm
One of the metrics that’s used to measure the performance of k-Means Algorithm is the mean distance of data points from the centroid of cluster that they belong to. However, this measure can be counterproductive as the higher number of clusters, the smaller distance between data points and its centroid, which may result in number of clusters (K) matching the number of data points, thereby harming the purpose of Clustering Algorithm.
To avoid this, you can plot average distance between data points and the cluster centroid against number of clusters. The appropriate number of clusters corresponds to breaking point of the plot, where rate of decrease drastically changes.
The mean-shift algorithm works by assigning each data point a cluster based on the density of the data points in the data space, also known as the mode in a distribution function. Contrary to the k-means algorithm, the mean-shift algorithm does not require you to specify the number of clusters as a parameter.
The algorithm works by modeling the data points as a distribution function, where high-density areas (high concentration of data points) represent high peaks. Then, the general idea is to shift each data point until it reaches its nearest peak, which becomes a cluster.
Understanding Mean-Shift Algorithm
The first step of the mean-shift algorithm is to represent the data points as a density distribution. To do so, the algorithm builds upon the idea of Kernel Density Estimation (KDE), which is a method that’s used to estimate the distribution of a set of data.
In the preceding diagram, the dots at the bottom of the shape represent the data points that the user inputs, while the cone-shaped lines represent the estimated distribution of the data points. The peaks (high-density areas) will be the clusters. The process of assigning data points to each cluster is as follows:
- A window of a specified size (bandwidth) is drawn around each data point.
- The mean of the data inside the window is computed.
- The center of the window is shifted to the mean.
Steps 2 and 3 are repeated until the data point reaches a peak, which will determine the cluster that it belongs to.
The bandwidth value should be coherent with the distribution of the data points in the dataset. For example, for a dataset normalised between 0 and 1, the bandwidth value should be within that range, while for a dataset with all values between 1,000 and 2,000, it would make more sense to have a bandwidth between 100 and 500.
In the following diagram, the estimated distribution is represented by the lines, while the data points are the dots. In each of the boxes, the data points shift to the nearest peak. All the data points in a certain peak belong to that cluster.
The number of shifts that a data point has to make to reach a peak depends on its bandwidth (the size of the window) and its distance from the peak.
Density-based spatial clustering of applications with noise(DBSCAN) algorithm groups together points that are close to each other and mark those points that are further away with no close neighbors as outliers. According to this, the algorithm classifies data points based on the density of all data points in the data space.
Understanding DBSCAN Algorithm
DBSCAN algorithm requires two main parameters – epsilon and the minimum number of observations.
- Epsilon – Maximum distance that defines radius within which algorithm searches for neighbours.
- Minimum number of observations – Number of data points required to form a high-density area.
Based upon Epsilon, Minimum number of observations values DBSCAN clusters data points in data space as – Core Point, Border Point, Noise Point.
|Point in DBSCAN Cluster||Description|
|Core Point||A point that has at least minimum number of data points within its epsilon radius|
|Border Point||A point that is within epsilon radius of a core point, but does not have required number of data points within its own radius|
|Noise Point||All points that do not meet criterion for being described as – Core Point or Border Point|