Unsupervised Machine Learning Series:Principal Component Analysis(2nd algorithm)

In the previous blog, we understood about 1st Unsupervised ml algo: Clustering algorithms . In this blog, we will cover our 2nd unsupervised algorithm, Principal Component Analysis (PCA). It is a statistical technique that is widely used in data science for reducing the dimensionality of large datasets. PCA allows us to identify the underlying patterns and relationships in the data and to transform it into a more compact and easily understandable format. In this blog, we will cover the basics of PCA, its use cases, practical implementation, and its limitations.

What is PCA?

PCA is a technique that transforms a set of correlated variables into a new set of uncorrelated variables called principal components. Each principal component is a linear combination of the original variables and is sorted based on the amount of variation in the data that it captures. The first principal component captures the most significant amount of variation in the data, followed by the second, and so on.

PCA is often used to reduce the dimensionality of large datasets while retaining as much of the original information as possible. It is also used for data compression, feature extraction, and visualization.

Use Cases of PCA

PCA has a wide range of use cases, some of which are listed below:

  1. Image Compression: PCA can be used to compress digital images by reducing the dimensionality of the image without losing much information.

  2. Genetics: PCA is used in genetics to identify the genetic markers that are associated with a particular disease or trait.

  3. Finance: PCA is used in finance to identify the underlying factors that affect stock prices, bond yields, and other financial instruments.

  4. Machine Learning: PCA is used in machine learning to reduce the number of features in a dataset and to identify the most important features for predicting a particular outcome.

Practical Implementation of PCA

Let's look at a simple example of how PCA can be implemented in Python using the scikit-learn library.

import numpy as np
from sklearn.decomposition import PCA

#create a sample dataset with 5 features and 100 observations
codeX = np.random.rand(100,5)

#Fit the PCA model and transform the data
codepca = PCA(n_components=2)
principalComponents = pca.fit_transform(X)

#Print the explained variance of each principal component
codeprint(pca.explained_variance_ratio_)

This will output an array with the explained variance of each principal component. The sum of these values will be equal to 1, which represents the total amount of variation in the data.

Conclusion

PCA is a powerful technique that can be used to reduce the dimensionality of large datasets while retaining as much of the original information as possible. It has a wide range of use cases, including image compression, genetics, finance, and machine learning. However, it is important to note that PCA has some limitations, including the assumption of linearity and the possibility of losing some information during the transformation. Therefore, it is important to carefully consider the use case and the limitations of PCA before applying it to a particular problem. Hope you got value out of this article. Subscribe to the newsletter to get more such informative blogs.

Thanks :)