Dimensionality Reduction

 Dimension of a tabular dataset is the number of columns it has. If we have a csv/xlsx file with 14 columns in it, then we say this file is 14 dimensional.

At times it becomes difficult to deal with datasets which have a huge number of features/columns. The more the number of dimensions, the longer it would take for algorithms to run on the dataset & higher the time & space complexity will be. The performance of algorithms gets impacted. This problem has a popular name, "Curse of Dimensionality"

In such cases we try to shrink the dataset from its original number of columns to a lower number of columns so that it becomes easier for us to handle and make analysis.



Say we have a dataset of 30 columns. If we could somehow reduce this to a dataset of say 15 columns without actually loosing any information, that would make our life much easier. In machine learning there are certain "dimensionality reduction" techniques which do just that. However in reality reducing the number of dimensions (also called feature reduction) do cost us some loss of information. But we can still afford that if the loss is negligible. For an even more intuitive understanding on this lowering of dimensions, I recommend you watch the essence of linear algebra video series by 3Blue1Brown on YouTube.

A few of the popular method to reduce dimensions are, Principal Component Analysis (PCA), t-SNE (t-distributed Stochastic Neighbor Embedding), Feature Selection etc. We have discussed each of these methods in details in their respective articles.

0 Comments