Skip to Main Content

Quantitative data collection and analysis

Samples to population

The population - in an ideal world you would ask everyone that meets the criteria for being included in the research i.e. the whole population (e.g. all 18 year olds in Middlesbrough)

The Sample - obtaining data from the whole population may be impractical or impossible so you need to consider looking at a sample, i.e. selecting only a certain amount (or subset) from that population.

It is important that the sample is representative of the whole population i.e. it mirrors what would be found in the whole population. Thinking about your sample requires careful thought. 

There are different ways of sampling - see under Sampling and sampling errors on the Quantitative data collection page of this guide (see top option on the left-hand menu).

 

Statistics - are based on your sample data. They can be descriptive (i.e. just describing or summarizing what you found) or they can be used to make a prediction or estimate from the smaller group to a larger population (Statistical inference). In equations they are normally referred to by using roman characters e.g. to represent the Mean of a sample

Parameters - refer to the whole population and tend to be referred to with Greek characters e.g(Mu) which represents the Mean of the population. 

Allocation

This is the process of assigning or allocating the people in the sample to a specific condition.

Again it is important that the people assigned to each condition are representative of the population.

Each person should be randomly allocated to a condition.

The validity of many statistical procedures is very often based on an assumption that the underlying population from which a sample is taken is normally distributed. These are parametric statistics. 

Non-parametric statistical procedures do not rely on such assumptions about the shape or form from which your data was drawn. 

Parametric

Non-parametric

Underlying population has a normal distribution Do not rely on any assumptions about the shape or form of your data
More statistically powerful

Generally less statistically powerful - i.e. they will find there is a smaller probability that the two variables are associated with each other when there is an actual association.

They are less likely to detect a significant effect when one exists.

Do not need as large a sample Requires a larger sample to have the same power
Results are easier to interpret (uses the actual data) Results are often less easy to interpret (as usually involves 'rankings' rather than actual data)
Best used when the Mean is an appropriate measure of central tendency Best used when the Median is the best measure of central tendency for your data
Used for continuous data. Results could be affected by outliers Some non-parametric tests can handle ordinal or ranked data and hence not affected by outliers

 

Parametric tests

Non-parametric tests

1-sample t-test 1-sample Wilcoxon
2-sample t test Mann-Whitney U test
One-way ANOVA

Kruskal-Wallis

   

This is a measure of the variability of the difference between a sample measure and a population measure.

It is calculated as the standard deviation of the difference between these two measures.

The higher the result the more spread out the data is (i.e.the more variable the data is).


Standard Error is the standard deviation of the population divided by the square root of the sample size.

Sample Means

If you work out the Mean of the Means of a set of different samples from the same population you will come up with the Sample Mean. e.g. you take five samples from a population and their means are 3, 5, 6, 7 and 9. The sample mean would be (3 + 5 + 6 + 7 + 9) divided by 5 = 6

 

This would be representative of the true population and they would be normally distributed. The Standard deviation of the Sample Means is also known as the Standard Error. 

 

It can be used to see if a sample is different from a population by chance (sampling variability) or whether the sample is actually quite different from the population.

The confidence interval is a range of values where it is likely the true value of the population parameter lies.

You can make estimates for any level of confidence. Two that are most often used are:

  • 95% confidence interval - 95% certain the population value will lie between two values (19 out of 20 times the true value will lie within this range).
  • 99% confidence interval - 99% certain the population value will lie between two values

e.g. for heights of a population we could say that the 95% Confidence Interval of the mean height of a population is 175 cm ± 6.20 cm. 

We can therefore be 95% confident that the average height of the population is between 168.8 cm and 181.2 cm. (95% CI =(168.8, 181.2).

There are 3 assumptions to make when using Confidence intervals:

  1. The sample is a random one
  2. Data points are independent - i.e. all respondents are independent and not influenced by others’ responses.
  3. The distribution of values are normally distributed.

 

The Confidence Interval is based on the Mean and Standard Deviation:

X is the mean

Z is the Z-value for the Confidence interval (i.e. 95% or 99%) - obtained from a normal distribution table (see below for an example)

s is the standard deviation of sample (if not too small)

n is the number of observations

The ± (plus or minus) indicates the margin of error

Confidence Interval table

(Obtained from the Maths is Fun web-site)

The larger the sample size and the smaller the variability of the sample will produce a more accurate estimate. 

Confidence Intervals can be represented graphically with hi-lo plots.

Sometimes you may be more interested in proportions rather than Means.

e.g. the proportion of teenagers who have been excluded from mainstream education. 

To work out the proportion divide the number of instances you are interested in by the total number in your sample.

(to make it a percentage times this result by 100).

 

Z scores

As is the case for sample means if we take repeated samples and find the proportion p of each of these samples then the mean of all these samples should approximate the population proportion.

The Z score allows you to estimate the chance of an outcome occurring.

p = the sample proportion

π = (‘pie’) population proportion

= number in sample

 

Confidence intervals can be used when working with proportions to ascertain whether your sample proportion reflects what would be found in the population, however the calculation is different.


e.g. taken from Foster, L., Diamond, I. and Jefferies, J. (2015) Beginning statistics: an introduction for Social Scientists. 2nd edn. London: SAGE, p.152.

Out of 200 pensioners surveyed. 58 replied that they had retired when they expected to.

That is p = 29% or 0.29 

We can conclude with 95% confidence that the proportion of pensioners in the population who retired when expected would be between 0.2271 and 0.3529 i.e. between 23% and 35% of pensioners.