Skip to Main Content

Quantitative data collection and analysis

Spread

One way of describing your data further is to measure the dispersion or the distribution of your data in order to see how spread out it is.

Three ways to do this are to calculate:

  • The range
  • The Inter-quartile range
  • The standard deviance

This is the difference between your largest observed value and the smallest observed value.

The range = largest observed value - the smallest observed value

However, It is heavily effected by extreme values (outliers) in your sample, so has to be used with caution.

Using the Inter-quartile range or standard deviation would be more useful to get an accurate picture of your distribution.

Inter-quartile range (IQR) = upper quartile (Q3) - lower quartile (Q1) (see under Averages and Percentiles)

i.e. it is the middle 50% of the distribution

It is usually used when the Median is used as a measure of central tendency.

Using the inter-quartile range will eliminate the effect of extreme values in your distribution.

If your data is normally distributed it will cluster around a central point in a symmetrical pattern.

When the values of that distribution are plotted against the frequency of their occurrence - the shape will appear as a 'bell curve' (see below). The bell curve is a symmetric distribution with the highest frequencies clustered around the mid-point and with the tails falling off evenly to infinity on either side.  It has a number of characteristics:

  • The Mean, Median and Mode are all the same value
  • it is symmetrical around the Mean value (i.e. one half of the curve mirrors exactly the other half)
  • The tails are asymptotic - this means the tails in the distribution get closer to the horizontal (the X) axis, though they will never reach this axis.

Image of a normal distribution bell curve

 

There are many situations where you will find things distributed normally (i.e. where there are lots of occurrences around the middle of the distribution and less at each end, e.g. height). However your data could be distributed in the following ways:

  • Skewed to the left i.e. there is a long-tail to the left (negative).
  • Skewed to the right i.e. there is a long-tail to the right (positive).
  • All mixed up (with no obvious pattern).

"Skewness" (the amount of skew) can be calculated, for example you could use the SKEW() function in Excel.

When you deal with lots of data (usually over 30) and if you take repeated samples of data from a population, what you will find is that the distribution of your results will start to resemble a normal distribution. This is important because when we infer from a sample to a population we are making an assumption of a normal distribution in the sample.

The Standard Deviation is a measure of how spread out the values in your distribution are.

It tells us how widely dispersed the values are around the Mean, that is how far away are they from the Mean. 

If the values are all close to the Mean then these values are not spread out (distributed).

If there are values lying far from the Mean then that distribution of values would be wider than one where the values lie closer together.  

It is useful to have a measure of this spread. 

 

The Standard Deviation indicates the average amount by which all the values deviate from the Mean.

The larger the standard deviation the greater is the spread of your data.

It is represented by SD or the lower case Greek letter sigma σ or the Latin letter s

It is usually used when your data is not too skewed (i.e. normally distributed).

A high standard deviation would indicate that your data is spread out

A lower value would indicate that your values are less spread out and more concentrated around the Mean.


The Variance is the average of the squared differences from the Mean - and is part of working out the standard deviation.

Standard deviation is the square root of the variance. This tends to be used in preference to the Variance as it has the benefit of being in the same units (e.g. centimetres, minutes etc.) as your original data so is easier to interpret.

 

To calculate the variance (of a sample):

  • Work out the Mean (the average) of your set of variables
  • For each variable - subtract this number from the Mean and square the results (the squared differences) Why square?
  • Calculate the average of those squared differences - that is add up these values you have just calculated together and divided by the number of cases minus one (see below as to why minus 1)
  • If you working with a sample (rather than the whole population) you need to use N-1 (Minus one from the number of your observations)  to extrapolate to the population. Just use N when you just interested in your set of data and don't wish to make any generalisations. (more information here)

To calculate the standard deviation (of a sample):

  • Take the square root of the variance

 

 are the observed values of your sample

  is the Mean value of your observations

is the number of observations in your sample

Σ (sigma) signifies working out the sum of (or total) a set of values (see below for more infomation).

 

The standard deviation slices up a normal distribution into 'standard' sized pieces, each slice contains a known percentage of the total observations.

If a data distribution is approximately normal then around about:

  • 68 percent of the data values are within one standard deviation of the mean
  • 95 percent are within two standard deviations
  •  99.7 percent lie within three standard deviations

Normal distribution and standard deviations