Correlation / Regression
The material covered in the
probability and
confidence interval pages should be understood or at least be familiar to
you before proceeding.
Correlation is a measure of how two measurements relate to each other. The
correlation coefficient, rho (r) is a value between -1 and 1, where r = 1
corresponds to a perfect positive relationship, r = -1 a perfect negative
relationship, and r = 0 implies no relationship.
The
two most commonly used methods to evaluate correlation are Pearson's Correlation
Coefficient (used for measured parametric (continuous measurements that are
normally distributed) outcomes) and Spearman's
Correlation Coefficient (used for nonparametric outcomes). This page describes
Pearson Correlation Coefficient.
Formulae for Pearson Correlation Coefficient
-
n is the number of paired measurements (an outcome for both x and y)
-
From the data collected, calculate x, x2, y, y2,
and xy
-
Obtain the sum of these five (5) parameters
-
Sum of Squares for x, SSqX = sum(x2) - sum(x)sum(x)/n
-
Sum of Squares for y, SSqY = sum(y2) - sum(y)sum(y)/n
-
Sum of products, SPr = sum(xy) - sum(x)sum(y)/n
-
Pearson's Correlation Coefficient, r = SPr / sqrt(SSqX x SSqY)
-
The correlation coefficient is r
-
Transform the correlation coefficient to approximate the Normal
distribution, Z = 0.5 Log((1+r)/(1-r))
-
The Standard Error of Z is 1/sqrt(n-3) and the 95%confidence interval
z-value=1.96
-
The 95% CI for the Normal approximation is F = Z-1.96/sqrt(n-3) to G =
Z+1.96/sqrt(n-3)
-
Transformed back to correlation coefficient values, the 95% CI is
(exp(2F)-1) / (exp(2F)+1) to (exp(2G)-1) / (exp(2G)+1)
Example calculation of Pearson's Correlation Coefficient
A group of children are selected to see if their reading ability is related
to their mathematical ability. The data are as follows
| Name |
reading score (x) |
math score (y) |
x2 |
y2 |
xy |
| Tom |
5 |
4 |
25 |
16 |
20 |
| Dick |
3 |
3 |
9 |
9 |
9 |
| Harry |
6 |
4 |
36 |
16 |
24 |
| Sue |
7 |
8 |
49 |
64 |
56 |
| Mary |
4 |
3 |
16 |
9 |
12 |
| Bill |
2 |
2 |
4 |
4 |
4 |
| Gloria |
9 |
10 |
81 |
100 |
90 |
| Tracy |
5 |
5 |
25 |
25 |
25 |
| Sum |
41 |
39 |
245 |
243 |
240 |
The plotted values are shown in
this scatter plot. We want to calculate the correlation coefficient and
its 95% Confidence Interval (CI) between reading and mathematical abilities.
Answers
-
Sum of x (Ex)
= 41, Sum of x2 (Ex2) = 245
-
Sum of y (Ey)
= 39, Sum of y2 (Ey2) = 243
-
Sum of x multiplied by
y (Exy) = 240, number of pairs (n) = 8
-
Sum of Squares for x (SSqX)
= 245 - 41 x 41 / 8 = 34.875
-
Sum of Squares for y (SSqY)
= 243 - 39 x 39 / 8 = 52.875
-
Sums of Product (SPr)
= 240 - 41 x 39 / 8 = 40.125
-
Pearson's Correlation
Coefficient, r =
40.125 / sqrt(34.875 x 52.875) = 0.93
-
Z = 0.5
Log((1+0.93)/(1-0.93)) = 0.5 Log(1.93 / 0.07) = 1.66. z-value for 95%
CI is 1.96
-
F = 1.66 - 1.96 /
sqrt(8-3) = 0.78
-
G = 1.66 + 1.96 /
sqrt(8-3) = 2.53
-
95%CI
= (exp(2 x 0.78)-1) / (exp(2 x 0.78)+1) to (exp(2 x 2.53)-1) / (exp(2 x
2.53)+1) = 0.65 to 0.99
-
Please note that CI for correlation coefficients are asymmetric; shorter
interval between r and 1 or -1 and a longer interval between r and 0
A
Computer Program to calculate the correlation coefficient and its CI is
available at the end of this page
Regression also measures the relationship between two measurements, but from a
different perspective than correlation. In simple linear regression, one
measurement variable, x (predictor, independent variable) is used to predict
another, y (outcome, dependent variable). Calculations minimise the error of
that prediction. The formula produced is y = a + bx, where b is
the slope of the regression line, and a is the intercept (value of y when x=0).
This formula predicts a y value for any value of x. Please note that if x is
used to predict y (i.e., x = a + by), then the a and b calculated will be
different, as it is a completely different prediction.
The
assumptions of regression and correlation are different. Correlation is
symmetric and both x and y assumed to be normally distributed. Regression is
asymmetric, in that x is used to predict y and not the other way around. Values
of y are usually assumed to be normally distributed, but x need not be.
Formulae for Regression Analysis
The first part of the calculation is the same as that for correlation
-
n is the number of paired measurements (an outcome for both x and y)
-
From the data collected, calculate x, x2, y, y2,
and xy
-
Obtain the sum of these five (5) parameters
-
Sum of Squares for x, SSqX = sum(x2) - sum(x)sum(x)/n
-
Sum of Squares for y, SSqY = sum(y2) - sum(y)sum(y)/n
-
Sum of products, SPr = sum(xy) - sum(x)sum(y)/n
Regression coefficient (b) for y = a + bx and its confidence interval
-
Regression slope b = SPr / SSqx
-
Residual Mean Square (RMS) = (SSqY - SPr x SPr / SSqX) / (n - 2);
-
Standard Error of the slope SEb = sqrt(RMS / SSqX)
-
CI for b = b - tSEb to b + tSEb Please
note that for regression t and not z is used, and t is sample size dependent
Intercept (a) for y = a + bx and confidence intervals for various predicted y values
-
Intercept. a = sum(y)/n - b sum(x)/n
-
The Confidence Interval (CI) of the estimated y value depends on two
things: the Standard error of the regression line and the residual
variance of y. Therefore, the CI is narrowest near the mean of x and
increases at more extreme values of x.
-
Standard error of an estimated y value, SEyx = sqrt( RMS (1 /
n + (x - mean of x)2/ SSqX) )
-
CI for yx = yx - tSEyx to y + tSEy
-
Please note that for regression t and not z is used, and t is dependent
on sample size (i.e., degrees of freedom)
Example for calculation of Pearson Correlation Coefficient
The same data as that in the correlation section is used. We wish to
determine if their reading ability is related to their mathematical
ability. Each child is given an exam and a reading a writing ability
score is obtained. The data are as follows:
| Name |
reading score (x) |
math score (y) |
x2 |
y2 |
xy |
| Tom |
5 |
4 |
25 |
16 |
20 |
| Dick |
3 |
3 |
9 |
9 |
9 |
| Harry |
6 |
4 |
36 |
16 |
24 |
| Sue |
7 |
8 |
49 |
64 |
56 |
| Mary |
4 |
3 |
16 |
9 |
12 |
| Bill |
2 |
2 |
4 |
4 |
4 |
| Gloria |
9 |
10 |
81 |
100 |
90 |
| Tracy |
5 |
5 |
25 |
25 |
25 |
| Sum |
41 |
39 |
245 |
243 |
240 |
We want to calculate the regression coefficient (how reading scores may
predict math scores) and its 95% CI, and the predicted values and their
95% CIs
Answers
-
Sum of x (Ex) = 41, Sum of x2 (Ex2)
= 245
-
Sum of y (Ey) = 39, Sum of y2 (Ey2)
= 243
-
Sum of x multiplied by y (Exy) = 240, number of pairs (n)
= 8
-
Sum of Squares for x (SSqX) = 245 - 41 x 41 / 8 = 34.875
-
Sum of Squares for y (SSqY) = 243 - 39 x 39 / 8 = 52.875
-
Sums of Product (SPr) = 240 - 41 x 39 / 8 = 40.125
The regression coefficient (b), the intercpt (a) and the 95% CI of b
-
b = 40.125 / 34.875 = 1.15
-
a = 39/8 - 1.15 (41/8) = -1.02
-
The formula y = -1.02 + 1.15 x
-
RMS = (52.875 - 40.125 x 40.125 / 34.875) / (8-2) = 1.12
-
Please note that degree of freedom (df) for regression is n-2 = 8-2
= 6
-
SEb = sqrt(1.12 / 34.875) = 0.17
-
95% CL for regression coefficient b = 1.15 - 2.45(0.17) to 1.15 +
2.45(0.17) = 0.73 to 1.57
-
Please note 2.45 is the t-value for 0.05 when df=6
A
Computer Program to perform regression analysis and calculate its CIs is available at the end of this page.
In biological systems, particularly in clinical patient care, outcomes
usually have multiple influences, and many of the influences are often
correlated. To model a system of multiple influences, and
particularly to isolate and identify individual influence, multiple regression
is often used.
Statistical calculations for multiple regression is complicated, and involves
the solving of simultaneous equations, using matrix inversion. Most
statistical packages provides a program for this calculation, so the mathematics
will not be presented here.
A particular case of multiple regression is curve fitting. This
is used when the relationship between x and y is curved, as is the case with
many biological systems. For example, growth tends to increase
rapidly after birth, but slows down for a few years, increases with the growth
spurt, then slow down again in late teens.
There are a number of approaches to curve fitting. The simplest
group consists of transforming the measurements to a form that have linear
relationship, then use the simple regression analysis to model the
data. An example is to curve fit the dose response relationship of a drug, often in the
form of y = a + b(log(x)).
The most complex group consists of Fourier's transform, which represent
outcome as a function of multiple sine waves. These are often used
to model fluctuating measurements such as heart rate, diurnal rhythms, or
economic cycles.
A commonly used method in biological and medical science is the polynomial curve fitting, and this is a
variant of the multiple regression. The theory is that the outcome y
is influenced by the independent variables x, x2,x3, and so on to
any level of
power. Realistically however, the cubic (third) power is sufficient
to fit most biological systems
For any pair of x and y measurements, additional measurements in the form of x2
and x3 are created, and these are then subjected to multiple regression
analysis. According to the power used, the
following curves can be fitted.
To be able to curve fit and produce a line to relate two
measurements is often sufficient for laboratory work, such as establishing a
dose response curve for a drug. For most clinicians, particularly
those who study growth and development, the confidence interval and an ability
to establish percentile are also important. The difficulty here is
that the normal way of calculating Standard Error is to complicated, as the
error for each coefficient and the residual error will have to be combined.
An
additional difficulty is that variations often increases with
measurements. For example, growth in the early embryo has a SD of a
few micrograms. By term babies vary by around 500g, and by late
teens growth varies by a few Kgs. Unfortunately, multiple regression
analysis assumes that the variance is uniform throughout, and this is clearly
not the case. A typical example is trying to fit a growth curve such
as femur length in early pregnancy, the increase in
variation with age can be shown in this fit.
Altman presented a two stage procedure, where the mean
relationship is firstly fitted, the departure from the mean (absolute residual)
for each case estimated, and this is also curve fitted. After
applying a transformation factor, a curve fitted formula for the Standard
Deviation
is also obtained. Confidence interval can then be calculated from
the mean. The result is shown in the
following graph.
The reference for this method is : Altman DG (1993)
Constructing age-related reference centiles using absolute residuals. Statistics
in Medicine 12(10):917-924. A
Computer Program to
curve
fit both the mean and SE is available at the end of this page
The
formula to estimate the sample size required to calculate Pearson’s correlation
coefficient is given below. Although calculations in simple linear regression
are similar to Pearson’s correlation coefficient, the sample size required
between the two differ slightly. The main reason behind this is that the
regression analysis requires the calculation of two variables (intercept and
slope) while correlation determines only one (the coefficient). Only the sample
size formula for the correlation coefficient is discussed.
Formulae to calculate sample size for Pearson's Correlation Coefficient
The sample size formula to calculate the correlation coefficient is based on
three (3) parameters: Type I error (usually 0.05), Type II error (usually
0.2), and the effect size which is the expected correlation coefficient (r).
-
Obtain the z value of
alpha (za). For 95% CI, z = 1.65
-
Note: This is for a one-sided Correlation Coefficient, where r is chosen
to be either + or -. If a two-sided coefficient is required (do not know
if r will be + or -, it is difficult to envisage that situation
arising), then alpha should be halved before z is obtained (i.e., z=1.96
rather than 1.65)
-
Obtain the z value of
beta (zb). usually (0.2) this is 0.85
-
Gamma g = (za
+ zb)2
Sample size, n = g / m + 3; where m = log((1+r)/(1-r)) / 2 + (r/(2(n-1)))
The formula is an iterative one, starting with n = log((1+r)/(1-r)) / 2 and
substituting n into the formula with the one calculated until the difference
of the two is less than one (1).
Example for sample size estimation for Pearson's Correlation Coefficient
Find the sample size required if the expected correlation coefficient is
0.5, with Type I error of 0.5 and Type II error of 0.2.
Answer
-
z value of alpha, za
= 1.65
-
z value of beta, zb
= 0.85
-
Gamma, g = (1.65+0.85)2
= 2.52 = 6.25
-
Initial n =
log((1+0.5)/(1-0.5)) / 2 = 0.5493
-
Iterative calculations are as follows
| Iteration |
nin |
m=log((1+r)/(1-r)) / 2 + (r/(2(nin-1))) |
nout=g / m2+3 |
| 1 |
0.5493 |
-0.0054 |
210845.4 |
| 2 |
210845.4 |
0.5493 |
23.71 |
| 3 |
23.71 |
0.5603 |
22.91 |
| 4 |
22.91 |
0.5607 |
22.88 |
| 5 |
22.88 |
0.5607 |
22.88 |
Iteration stopped at n=22.88, so rounding up to the next whole number
gives a sample size of 23 cases.
Exercise in sample size for Correlation Coefficient
Create a sample size
table for correlation coefficients of 0.05 to 0.95 at 0.05 intervals,
with alpha of 0.05 and a beta of 0.2.
AnswerAlpha Beta r Sample size
0.05 0.2 0.05 2471
0.05 0.2 0.1 617
0.05 0.2 0.15 273
0.05 0.2 0.2 153
0.05 0.2 0.25 97
0.05 0.2 0.3 67
0.05 0.2 0.35 49
0.05 0.2 0.4 37
0.05 0.2 0.45 29
0.05 0.2 0.5 23
0.05 0.2 0.55 19
0.05 0.2 0.6 16
0.05 0.2 0.65 13
0.05 0.2 0.7 11
0.05 0.2 0.75 9
0.05 0.2 0.8 8
0.05 0.2 0.85 7
0.05 0.2 0.9 6
0.05 0.2 0.95 5
A
Computer Program to estimate the sample size required for Pearson's Correlation Coefficient is available at the end of this page
Version 1.0 Last change 18th October 2006