Concept of variance, standard deviation, median, percentile, quantile, Median Absolute Deviation, IQR

Mean is sensitive to outliers. If our data has one or a few outlier then the mean may become corrupted. Mean tells us about the central tendency.

There is also a concept of spread. Of course we can have a look at the plot of our data & have an idea about the spread of the distribution. However there has to be a quantitative measure to express spread of the data. This is where variance comes to the picture. Variance represents spread


variance :



Standard Deviation is the square root of variance. Hence 

std :



Std can also give us an idea on the spread of the data. 


Python code for std is : np.std( column name )

Less std = low spread

high std = high spread


However, here also the presence of outlier can corrupt our measure of spread immensely as the formula of variance & hence of std as well, depend upon the mean value.


Median: 

Median is something that can save us from this outlier influence problem. median is the center value of our dataset & it also speaks of the centrality of data.

python code : np.median()

Presence of outlier doesn't affect it. to find the median value, sort the data & take the middle value.


Percentile:

What percentage of data/values lie below that value.


Quantiles:

25th, 50th, 75th, 100th percentiles = 1st, 2nd, 3rd, 4th quantile

Python code: np.percentile( column, np.arange( 0, 100, 25 ) )


Median Absolute Deviation:

python code: 

from statsmodels import robust

robust.mad( column )


It has a similar notion as that of the std. 




IQR (Inter Quartile Range):

75th percentile - 25th percentile

50% of data remain within IQR.



0 Comments