Correlation measures the association between two continuous variables.
A correlation can be positive or negative:
e.g. the higher the temperature is, the more ice-cream is sold.
A continuous variable is one where an observed value can take any value on a continuous scale e.g. Interval and ratio variables.
A scatter graph can help examine and show the linear relationship between two varaiables.
To quantify the association between two variables you can calculate the correlation coefficient.
Regression analysis allows you to predict the value of one variable from the values of another.
Note: Just because there is an association between two variables does not necessarily mean that one variable explains the other. There could be a third factor (confounding variable) influencing the variables.
The correlation coefficient can be used to measure numerically the strength of a linear relationship between two variables.
Pearson's r is usually used to calculate the correlation coefficient. It can be used to establish the value of any linear relationship.
The correlation coefficient measures the strength of a linear relationship - i.e. how much the data shows as a straight line if you draw a scatter graph.
However, if the data follows a curved relationship then r will not be a good measure of the strength of association between your two variables.
Regression analysis is a technique that allows you to predict values of a dependent variable when you know the values of one or more independent variables.
The dependent variable is what is being studied or measured, i.e. it is what changes as a result of changes made to the independent variable. This normally goes along the y (vertical) axis on a graph.
The independent variable is the one controlled by the experimenter. It is the one that is changed or manipulated. This normally goes along the x (horizontal) axis on a graph.
For instance you may want to estimate a score on an End of Course assessment (ECA) based on what someone got on an In-course assessment (ICA). We could do this if there was some correlation between these two marks - this would allow you to give a better informed estimate of the ECA. The stronger the correlation between the two values then the better the estimate will be.
100% accurate estimates or predictions would only be possible if there was a perfect correlation between the variables (i.e. on a scatter graph they would produce a straight line). So if we find out a ‘line of best fit’, which acts like a measure of central tendency (i.e. it try’s to average out variations). We could use this line to make a more precise estimate of values on the other variable. You can read from the graph what a corresponding ECA mark would be if you knew the ICA mark. This could be be done by eye but would be difficult and different people would come up with different results.
Regression analysis is the way of computing this line of ‘best fit’.
y =mx + b
y is how far up the vertical axis is the value (this is the value we are wanting to estimate)
x is how far along the horizontal axis is the value (this is the value we know)
m is the slope or gradient (i.e. how steep the line is) (how to calculate this). The gradient is the change in height divided by the change in horizontal distance - on a graph that would be change in y divided by change in x.
b is where the line crosses the Y axis (Y intercept)