Principal Component Analysis

Idea behind PCA: 

Principal Component Analysis is one of the dimensionality reduction techniques. If you are unsure of what is the need to reduce dimensionality of a dataset, please skim through this small post here very quickly to get a clear idea.

PCA is performed as part of EDA(Exploratory Data Analysis). The main idea behind PCA is to project all points onto a lower dimension so that we end up having the same information but represented in a lower dimension. In order to achieve this we first identify principal components. Principal Components are nothing but features made out of linear combinations of original features of our dataset. Think of these as new features made out of old features. Just that these new features don't actually have any physical meaning. They are only for calculation. As a toy example, say we had columns named, name, age, maturity, and gender in our original dataset, we may end up with columns X1, X2, X3 & X4 after performing PCA.

Once we found out all the principal components, we sort them in descending order. The top component represents the maximum information, the second one holds the second highest amount of information and so on. Hence we pick only that many number of principal components from top which deliver us above 90% (it's upto us to pick what percentage of information we want to retain) of the total information and drop the rest of the components. 

Say in our previous example we sorted the components and found it to be in order of X2>X4>X3>X1, and X2 holding 89% information, X4 holding 6% information and X1 & X3 each holding 2.5% of the total information of the datatset. Now we can keep only features X2 & X4 and drop feature X1 & X3 and we will still retain about 95% of the information. Also we now reduced the number of dimensions. Also one thing to notice is that the feature "maturity" in our example is kind of correlated to age. It is safe to say that as age increases, maturity goes up. Hence the maturity feature actually held redundant information (information that was already there in age feature). This is just a very dumbed down example. In reality it gets a bit more nuanced. 

Features/components we get after PCA are always uncorrelated and hence do not hold any redundant information. Remember that principal components do not represent the original features. It will be wrong to say that X1 signifies name and X2 signifies age.

To achieve this in python we have a PCA method available in Sklearn library. Here is a small example for your reference.


Geometric Intuition of PCA:

Let's understand PCA geometrically. In this plot, all these data points are spread in 2D. But when we attempt to draw a line(PCA does just that) such that all the points when projected onto that line, the line can explain maximum variance in the data meaning, it must best explain the overall spread of the data. Now we end up with this line shown in the figure and we also made our data mapped into a lower dimension (1D in this case).

A little extras:

Now that we understood the overall idea of PCA. Let's understand how exactly it works under the hood.

First the algorithm computes the eigen values & eigen vectors for our data matrix and then it sorts the eigen values all in descending order only to compute the principal components by multiplying original data points with their respective eigen vectors. Geometrically, PCA tends to find a line that best preserves the maximum amount of information. This means PCA preserves the direction that explains the maximum variance in data. The relation between maximum information & maximum variance here is, the more the amount of information preserved, the higher the spread of data points along that line proposed by PCA explaining higher the variance.   

0 Comments