Unsupervised Machine Learning Series: Clustering(8th algorithm)

In the previous article, we understood the 7th Unsupervised ml algo: association rule learning. In this blog, we will cover our 8th unsupervised algorithm, Clustering. In the field of machine learning, clustering is a powerful technique used to uncover patterns and group similar data points together in an unsupervised manner. Unlike supervised learning algorithms that rely on labeled data, clustering algorithms operate on unlabeled data, making it ideal for exploratory data analysis, data mining, and pattern recognition. This blog will provide a detailed overview of clustering, including its types, popular algorithms, and a code implementation example.

Types of Clustering

K-means Clustering: K-means clustering is one of the most widely used clustering algorithms. It partitions the data into k distinct clusters, where k is a pre-defined number chosen by the user. The algorithm assigns data points to the nearest centroid, iteratively optimizing the clustering by minimizing the sum of squared distances between the data points and their assigned centroids.

Hierarchical Clustering: Hierarchical clustering builds a hierarchy of clusters by successively merging or splitting existing clusters. It can be represented as a tree-like structure called a dendrogram. Hierarchical clustering can be agglomerative (bottom-up) or divisive (top-down), and it does not require the user to specify the number of clusters in advance.

Density-Based Clustering: Density-based clustering algorithms, such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise), group together data points that are closely packed and separate regions with low-density areas. It does not assume spherical clusters like K-means and is robust to noise and outliers.

Clustering Algorithms:

Let's take a closer look at the algorithms behind the aforementioned types of clustering:

K-Means Algorithm has been covered in the previous article here

Agglomerative Hierarchical Clustering Algorithm:

Start with each data point as a separate cluster.
Compute the proximity matrix based on the distance between clusters.
Merge the two closest clusters based on a linkage criterion (e.g., complete linkage, single linkage, or average linkage).
Update the proximity matrix.
Repeat steps 2-4 until only a single cluster remains.

DBSCAN Algorithm

Select a random unvisited data point.
If the point has sufficient neighbouring points within a specified radius (epsilon) and forms a dense region, creates a new cluster and expands it by adding reachable neighbouring points.
Repeat steps 1 and 2 for all unvisited points until all data points are visited.

Conclusion

Clustering is a valuable unsupervised learning technique that enables the discovery of hidden patterns and groupings in unlabeled data. With various clustering algorithms available, such as K-means, hierarchical, and density-based clustering, it is possible to apply clustering to diverse problem domains. By implementing and experimenting with these algorithms, data scientists can gain valuable insights and make informed decisions.