One way of describing your data further is to measure the dispersion or the distribution of your data in order to see how spread out it is.
Three ways to do this are to calculate:
This is the difference between your largest observed value and the smallest observed value.
The range = largest observed value - the smallest observed value
However, It is heavily effected by extreme values (outliers) in your sample, so has to be used with caution.
Using the Inter-quartile range or standard deviation would be more useful to get an accurate picture of your distribution.
Inter-quartile range (IQR) = upper quartile (Q3) - lower quartile (Q1) (see under Averages and Percentiles)
i.e. it is the middle 50% of the distribution
It is usually used when the Median is used as a measure of central tendency.
Using the inter-quartile range will eliminate the effect of extreme values in your distribution.
If your data is normally distributed it will cluster around a central point in a symmetrical pattern.
When the values of that distribution are plotted against the frequency of their occurrence - the shape will appear as a 'bell curve' (see below). The bell curve is a symmetric distribution with the highest frequencies clustered around the mid-point and with the tails falling off evenly to infinity on either side. It has a number of characteristics:
There are many situations where you will find things distributed normally (i.e. where there are lots of occurrences around the middle of the distribution and less at each end, e.g. height). However your data could be distributed in the following ways:
"Skewness" (the amount of skew) can be calculated, for example you could use the SKEW() function in Excel.
When you deal with lots of data (usually over 30) and if you take repeated samples of data from a population, what you will find is that the distribution of your results will start to resemble a normal distribution. This is important because when we infer from a sample to a population we are making an assumption of a normal distribution in the sample.
The Standard Deviation is a measure of how spread out the values in your distribution are.
It tells us how widely dispersed the values are around the Mean, that is how far away are they from the Mean.
If the values are all close to the Mean then these values are not spread out (distributed).
If there are values lying far from the Mean then that distribution of values would be wider than one where the values lie closer together.
It is useful to have a measure of this spread.
The Standard Deviation indicates the average amount by which all the values deviate from the Mean.
The larger the standard deviation the greater is the spread of your data.
It is represented by SD or the lower case Greek letter sigma σ or the Latin letter s
It is usually used when your data is not too skewed (i.e. normally distributed).
A high standard deviation would indicate that your data is spread out
A lower value would indicate that your values are less spread out and more concentrated around the Mean.
The Variance is the average of the squared differences from the Mean - and is part of working out the standard deviation.
Standard deviation is the square root of the variance. This tends to be used in preference to the Variance as it has the benefit of being in the same units (e.g. centimetres, minutes etc.) as your original data so is easier to interpret.
To calculate the variance (of a sample):
To calculate the standard deviation (of a sample):
are the observed values of your sample
is the Mean value of your observations
N is the number of observations in your sample
Σ (sigma) signifies working out the sum of (or total) a set of values (see below for more infomation).
The standard deviation slices up a normal distribution into 'standard' sized pieces, each slice contains a known percentage of the total observations.
If a data distribution is approximately normal then around about: