Step 3. Exploratory Data Analysis

Exploratory Data Analysis or EDA is the third important step in a data science project pipeline. 

The first two are Data Extraction & Data Pre-processing. If you haven't understood these first two steps clearly yet or if you knew it but your memory of them has gone a bit rusty, hop on to these two links below to brush up the concepts in a jiffy.

Links to Data Extraction & Data Pre-processing.

EDA is where the data has already been fine tuned and ready for analysis. In this step we try to understand the data and try to see if there exists any pattern in it or of any inference can be drawn based on the underlying information using the existing data. It is important to note that this is different from the upcoming step i.e. Machine learning implementation because in here we only focus on "what was", whereas in the machine learning phase we utilize what data we already have, to predict or forecast "what will be"

EDA can be broken down into a few components. They are as follows. Univariate analysis, multivariate analysis.



Univariate Analysis:

Univariate Analysis is the analysis done using a single feature. Univariate analysis can be performed on each feature of a dataset. 

Under the univariate analysis part, we can draw plots and find out summaries

Univariate plots include boxplots, histogram, density plot, pie charts etc.

histogram on Titanic dataset
Example of histogram



Example of boxplot



Example of a density plot


For summary we can calculate mean, median, std, range, IQR etc.


Bivariate Analysis:

Bivariate analysis is the analysis involving two variables / features. We first check and identify independent and dependent variables in the problem. 

Correlation can be possible between the variables. 

There can be 3 types of bivariate analysis.

(i) Numeric - Numeric 

Plots to be used : Scatter plot, Correlation heat map

Linear correlation can be checked. Various correlation coefficients can be found out.

Correlation coeff. gives idea about the degree of correlation that exists among the variables of interest.

(ii) Categorical - Categorical

Plots that can be used: Stacked Column chart, combination chart

example of a Stacked Column chart


To check association between two categorical variables - Chi-Square test

Chi-Square test gives a probability. 

A probability of 0 = Completely dependent

A probability of 1 = Independent

Here the coefficient signifies a measure of dependency.

(iii) Numeric - Categorical

Plots - Line charts with error bars

Example of a Line chart with error bars (X Axis - Cat, Y Axis - Numeric)

Z test / t test is performed. Z test & t test are basically the same. 

If he row count is < 30 then we call it a t test 

and for row count of > 30 it is called a z test.

t-test is done to determine if the avg of the 2 groups (2 categorical valued groups) are different from each other.

You might think how to compare avg of two categorical values groups. The below demonstration on the iris dataset may help understand.

Numeric - Cat Comparison




So a Numeric to Categorical comparison is NOT a direct comparison. It is just comparing the Numeric features of 2 or more categorical values.

t - test is for 2 categorical variables

ANOVA test is for More than 2 categorical variables.

ANOVA test checks whether avg.s of more than 2 groups are statistically different from each other.


Multivariate Analysis:

This is the most pragmatic kind of analysis. In real world almost every analysis is multivariate as no one outcome can be predicted based on just one or two independent factors. We can not predict the weather of an area just based on the elevation & humidity. There are several other factors such as the location (latitude & longitude), radiant energy of that region, moisture, temperature etc which contribute to the weather condition of a region. As the name suggests it involves analysis more than 2 independent variables.

A cluster analysis showing 3 groups
A simple example of multivariate analysis


We will dig deeper into the multivariate analysis in later discussions as this segment requires a special attention.

One Extra Bite:

Tests we discussed above can be categorized into two kinds.

Parametric tests & Non-parametric tests

 

Parametric tests:

Tests which make assumptions about the parameters of the population from which the sample is drawn. (often the assumption is, the population data is Normally Distributed)

 

Examples of parametric tests: Pearson Correlation, t test / z test

 

Non-parametric tests:

It makes no prior assumptions. This test can be used on any distribution.

 

Example: Kendall’s rank correlation, Spearman’s rank correlation

* generally parametric tests have more statistical power than non parametric tests.



Once we are done with the Exploratory Analysis, we get a clearer understanding of our data. We understand the underlying behaviors better and we are capable of explaining the data much better than just random guessing. Many BI tools in use currently do exactly what we just discussed. Tools such as Microsoft Power BI or Tableau or any other such tool that is used to create nice dashboards to tell stories based on data, makes use of the exploratory data analysis to do what they do. 

When we are done with the exploration, we then try to use the knowledge to make predictions about the future. We want to make use of what we already know from our data to predict what could be the future. 

So the next topic on our list is Application of Machine Learning.