Correlation / Regression

Contents


Introduction

The material covered in the probability and confidence interval pages should be understood or at least be familiar to you before proceeding.

Back to Top

Correlation

Correlation is a measure of how two measurements relate to each other.  The correlation coefficient, rho (r) is a value between -1 and 1, where r = 1 corresponds to a perfect positive relationship, r = -1 a perfect negative relationship, and r = 0 implies no relationship.

The two most commonly used methods to evaluate correlation are Pearson's Correlation Coefficient (used for measured parametric (continuous measurements that are normally distributed) outcomes) and Spearman's Correlation Coefficient (used for nonparametric outcomes).  This page describes Pearson Correlation Coefficient.

A Computer Program to calculate the correlation coefficient and its CI is available at the end of this page

Back to Top

Regression

Regression also measures the relationship between two measurements, but from a different perspective than correlation.  In simple linear regression, one measurement variable, x (predictor, independent variable) is used to predict another, y (outcome, dependent variable).  Calculations minimise the error of that prediction.  The formula produced is y = a + bx, where b is the slope of the regression line, and a is the intercept (value of y when x=0).  This formula predicts a y value for any value of x.  Please note that if x is used to predict y (i.e., x = a + by), then the a and b calculated will be different, as it is a completely different prediction.

The assumptions of regression and correlation are different.  Correlation is symmetric and both x and y assumed to be normally distributed.  Regression is asymmetric, in that x is used to predict y and not the other way around.  Values of y are usually assumed to be normally distributed, but x need not be.

A Computer Program to perform regression analysis and calculate its CIs is available at the end of this page.

Back to Top

Multiple Regression and curve fitting

In biological systems, particularly in clinical patient care, outcomes usually have multiple influences, and many of the influences are often correlated.   To model a system of multiple influences, and particularly to isolate and identify individual influence, multiple regression is often used.

Statistical calculations for multiple regression is complicated, and involves the solving of simultaneous equations, using matrix inversion.   Most statistical packages provides a program for this calculation, so the mathematics will not be presented here.

A particular case of multiple regression is curve fitting.   This is used when the relationship between x and y is curved, as is the case with many biological systems.   For example, growth tends to increase rapidly after birth, but slows down for a few years, increases with the growth spurt, then slow down again in late teens.

There are a number of approaches to curve fitting.   The simplest group consists of transforming the measurements to a form that have linear relationship, then use the simple regression analysis to model the data.   An example is to curve fit the dose response relationship of a drug, often in the form of y = a + b(log(x)).

The most complex group consists of Fourier's transform, which represent outcome as a function of multiple sine waves.   These are often used to model fluctuating measurements such as heart rate, diurnal rhythms, or economic cycles.  

A  commonly used method in biological and medical science is the polynomial curve fitting, and this is a variant of the multiple regression.   The theory is that the outcome y is influenced by the independent variables x, x2,x3, and so on to any level of power.  Realistically however, the cubic (third) power is sufficient to fit most biological systems

For any pair of x and y measurements, additional measurements in the form of x2 and x3 are created, and these are then subjected to multiple regression analysis.  According to the power used, the following curves can be fitted.

To be able to curve fit and produce a line to relate two measurements is often sufficient for laboratory work, such as establishing a dose response curve for a drug.   For most clinicians, particularly those who study growth and development, the confidence interval and an ability to establish percentile are also important.   The difficulty here is that the normal way of calculating Standard Error is to complicated, as the error for each coefficient and the residual error will have to be combined.

An additional difficulty is that variations often increases with measurements.   For example, growth in the early embryo has a SD of a few micrograms.   By term babies vary by around 500g, and by late teens growth varies by a few Kgs.   Unfortunately, multiple regression analysis assumes that the variance is uniform throughout, and this is clearly not the case.   A typical example is trying to fit a growth curve such as femur length in early pregnancy, the increase in variation with age can be shown in this fit.

Altman presented a two stage procedure, where the mean relationship is firstly fitted, the departure from the mean (absolute residual) for each case estimated, and this is also curve fitted.   After applying a transformation factor, a curve fitted formula for the Standard Deviation is also obtained.   Confidence interval can then be calculated from the mean.   The result is shown in the following graph.

The reference for this method is : Altman DG (1993) Constructing age-related reference centiles using absolute residuals. Statistics in Medicine 12(10):917-924.  A Computer Program to curve fit both the mean and SE is available at the end of this page

Back to Top

Sample size

The formula to estimate the sample size required to calculate Pearson’s correlation coefficient is given below.  Although calculations in simple linear regression are similar to Pearson’s correlation coefficient, the sample size required between the two differ slightly.  The main reason behind this is that the regression analysis requires the calculation of two variables (intercept and slope) while correlation determines only one (the coefficient).  Only the sample size formula for the correlation coefficient is discussed.

A Computer Program to estimate the sample size required for Pearson's Correlation Coefficient is available at the end of this page
Back to Top

Computer Programs

Back to Top

Version 1.0  Last change 18th October 2006