What is dummy variable and how to introduce dummy variable in regression using python?



Before understanding dummy variable let us be clear about categorical variable. Categorical variable in a dataset is the kind of variable which is expressed in language instead of numbers. For example, Gender, Nationality etc. In the following example the STATE column, STATUS column & the Region column can be called as "categorical variables".


In order to be able to perform numerical calculations using categorical variables, we convert those categorical variables to digits. Once the conversion is done, it becomes easier for an analyst to perform regression or any mathematical operation on the same variable which was impossible before as we could not take average of three values when values were "high", "medium" & "low". Those categorical variables when converted to numbers are called the dummy variables. In the above example when the values of column "STATUS" are converted into 1s & 0s we needed to express them in 3 new columns. The 3 columns from the right are the dummy variables representing the same information as the categorical variable "STATUS".

Dummy variables are quantitative in nature. They have a small range and they generally take up only two values, 0 & 1.


How to decide the optimal number of dummy variables needed?

Please note that, the number of dummy variables required to represent a categorical variable should not be equal to the number of categorical variables present. To express a categorical variable that can assume n values, we need to define n-1 dummy variables.

Let us understand this with the example in hand. Assume we are interested in 'academic success', a variable that can assume three values, 'high', 'medium' & 'low'. We can represent the same using only 2 dummy variables as follows.


X = 1 when 'academic success' is 'high' else 0

Y = 1 when 'academic success' is 'low' else 0


Here, if you noticed, we did not have to create a dummy variable for categorical variable value "medium". If X=0 AND Y=0 it will mean the 'academic success' = "medium".

Likewise for every case we always choose n-1 number of dummy variables for n categorical variables. It might be tempting to assume n number of dummy variables but assuming n dummy variables may require extra computing power & you may run into severe multi-collinearity problems during analysis. As the nth dummy variable is redundant and carries no unique information that can not be derived from existing variables hence it must be avoided at all times.


Once a categorical variable is represented as a dummy variable, that dummy variable can be utilized in regression analysis just like any other quantitative variable.

For example, Assume we wanted to understand the relation between individual income and academic success (i.e., high, medium or low). The regression equation might be:

Income = b0 + b1X1+ b2X2

where b0, b1, and b2 are regression coefficients. X1 and X2 are dummy variables defined as:
X1 = 1, if Academic success is high; X1 = 0, otherwise.
X2 = 1, if Academic success is low; X2 = 0, otherwise.

The value of the categorical variable that is not represented explicitly by a dummy variable is called the reference group. In this example, the reference group consists of people with medium academic success.

In analysis, each dummy variable is compared with the reference group. In this example, a positive regression coefficient means that income is higher for the dummy variable than for the reference group; a negative regression coefficient means that income is lower. If the regression coefficient is statistically significant, the income discrepancy with the reference group is also statistically significant.

To elaborate it further:

Say, 

$1000 = b0 + b1X1+ b2X2 (When X1 = 1 & X2 = 0)
$250   = b0 + b1X1+ b2X2 (When X= 0 & X2 = 1)
$500   = b0 + b1X1+ b2X(When X= 0 & X2 = 0)

This means,


$1000 = b0 + b1
$250   = b0 + 0 + b2
$500   = b0 

So we can say a "high" academic success translated to a $1000 income,
a "medium" academic success translated to a $500 income &
a "low" academic success translated to a $250 income.

How to introduce Dummy Variable in Python:


Whatever we learned till here are the statistical understanding behind the concept of dummy variables & it will certainly help you built a strong intuition about the subject. However if you are one of those who is going to be using python programming for the rest of your life for analysis, you may never actually need to go through all the above steps & devise the optimal number of dummy variables that your dataset may need. Python's Pandas library already has some super cool function which takes care of all of that & saves us the hassle.

Here is a sample code that creates dummy variables for the Categorical Column we need.

These are the top 5 rows of our dataset. When we apply the below code,




in the first row, we use a function from Pandas called get_dummies(), which creates the optimal number of dummy variables as columns of a dataset and stores them in our example in df_dummy. We are concatenating the new DataFrame with our old DataFrame so as to view them together for better understanding. Here is how the output result appears.


Please note that even if the STATUS column here shows only two values, high & low, it actually has three values, high, med & low; which is why the get_dummies() when called, created three columns for the three categorical values & assigned them with necessary numerical values.




 

7 Comments

  1. Wow. Loved your article. I was struggling to find one single article on internet that could explain dummy variables to me & show me the python code for it in literally dummy language. Thanks for writing this. Love from Volgograd.

    ReplyDelete
  2. Man!! your video is so easy to interpret. & u've got a lovely Indian accent.
    Love from Minsk.

    ReplyDelete
  3. Your code isn't working for me. get dummies is not defined... can anyone help

    ReplyDelete
  4. Hey Anonymous :P
    Actually he has missed this one line of code that you will need to run beforehand.
    import pandas as pd

    this should fix your issue.

    ReplyDelete

  5. Thanks @Zahar N.
    It resolved my issue.

    ReplyDelete
  6. The elaboration and the details of the concept is very precise.
    Appreciate it.

    ReplyDelete
  7. Very detailed elaboration of the concept. Good to know the back-story of frequently used python function.

    ReplyDelete

BRING THIS DEAD POST TO LIFE