Mean is sensitive to outliers. If our data has one or a few outlier then the mean may become corrupted. Mean tells us about the central tendency.
There is also a concept of spread. Of course we can have a look at the plot of our data & have an idea about the spread of the distribution. However there has to be a quantitative measure to express spread of the data. This is where variance comes to the picture. Variance represents spread.
variance :
Standard Deviation is the square root of variance. Hence
std :
Std can also give us an idea on the spread of the data.
Python code for std is : np.std( column name )
Less std = low spread
high std = high spread
However, here also the presence of outlier can corrupt our measure of spread immensely as the formula of variance & hence of std as well, depend upon the mean value.
Median:
Median is something that can save us from this outlier influence problem. median is the center value of our dataset & it also speaks of the centrality of data.
python code : np.median()
Presence of outlier doesn't affect it. to find the median value, sort the data & take the middle value.
Percentile:
What percentage of data/values lie below that value.
Quantiles:
25th, 50th, 75th, 100th percentiles = 1st, 2nd, 3rd, 4th quantile
Python code: np.percentile( column, np.arange( 0, 100, 25 ) )
python code:
from statsmodels import robust
robust.mad( column )
It has a similar notion as that of the std.
IQR (Inter Quartile Range):
75th percentile - 25th percentile
50% of data remain within IQR.
0 Comments
BRING THIS DEAD POST TO LIFE