Before understanding dummy variable let us be clear about categorical variable. Categorical variable in a dataset is the kind of variable which is expressed in language instead of numbers. For example, Gender, Nationality etc. In the following example the STATE column, STATUS column & the Region column can be called as "categorical variables".
In order to be able to perform numerical calculations using categorical variables, we convert those categorical variables to digits. Once the conversion is done, it becomes easier for an analyst to perform regression or any mathematical operation on the same variable which was impossible before as we could not take average of three values when values were "high", "medium" & "low". Those categorical variables when converted to numbers are called the dummy variables. In the above example when the values of column "STATUS" are converted into 1s & 0s we needed to express them in 3 new columns. The 3 columns from the right are the dummy variables representing the same information as the categorical variable "STATUS".
Dummy variables are quantitative in nature. They have a small range and they generally take up only two values, 0 & 1.
How to decide the optimal number of dummy variables needed?
Please note that, the number of dummy variables required to represent a categorical variable should not be equal to the number of categorical variables present. To express a categorical variable that can assume n values, we need to define n-1 dummy variables.
Let us understand this with the example in hand. Assume we are interested in 'academic success', a variable that can assume three values, 'high', 'medium' & 'low'. We can represent the same using only 2 dummy variables as follows.
X = 1 when 'academic success' is 'high' else 0
Y = 1 when 'academic success' is 'low' else 0
Here, if you noticed, we did not have to create a dummy variable for categorical variable value "medium". If X=0 AND Y=0 it will mean the 'academic success' = "medium".
Likewise for every case we always choose n-1 number of dummy variables for n categorical variables. It might be tempting to assume n number of dummy variables but assuming n dummy variables may require extra computing power & you may run into severe multi-collinearity problems during analysis. As the nth dummy variable is redundant and carries no unique information that can not be derived from existing variables hence it must be avoided at all times.
Once a categorical variable is represented as a dummy variable, that dummy variable can be utilized in regression analysis just like any other quantitative variable.
For example, Assume we wanted to understand the relation between individual income and academic success (i.e., high, medium or low). The regression equation might be:
Income = b0 + b1X1+ b2X2
where b0, b1, and b2 are regression coefficients. X1 and X2 are dummy variables defined as:
X1 = 1, if Academic success is high; X1 = 0, otherwise.
X2 = 1, if Academic success is low; X2 = 0, otherwise.
The value of the categorical variable that is not represented explicitly by a dummy variable is called the reference group. In this example, the reference group consists of people with medium academic success.
In analysis, each dummy variable is compared with the reference group. In this example, a positive regression coefficient means that income is higher for the dummy variable than for the reference group; a negative regression coefficient means that income is lower. If the regression coefficient is statistically significant, the income discrepancy with the reference group is also statistically significant.
This means,
How to introduce Dummy Variable in Python:
Please note that even if the STATUS column here shows only two values, high & low, it actually has three values, high, med & low; which is why the get_dummies() when called, created three columns for the three categorical values & assigned them with necessary numerical values.
7 Comments
Wow. Loved your article. I was struggling to find one single article on internet that could explain dummy variables to me & show me the python code for it in literally dummy language. Thanks for writing this. Love from Volgograd.
ReplyDeleteMan!! your video is so easy to interpret. & u've got a lovely Indian accent.
ReplyDeleteLove from Minsk.
Your code isn't working for me. get dummies is not defined... can anyone help
ReplyDeleteHey Anonymous :P
ReplyDeleteActually he has missed this one line of code that you will need to run beforehand.
import pandas as pd
this should fix your issue.
ReplyDeleteThanks @Zahar N.
It resolved my issue.
The elaboration and the details of the concept is very precise.
ReplyDeleteAppreciate it.
Very detailed elaboration of the concept. Good to know the back-story of frequently used python function.
ReplyDeleteBRING THIS DEAD POST TO LIFE