Sample size for comparing two groups

Contents


Introduction

The material covered in the probability and statistical significant pages should be understood or at least be familiar to you before proceeding.  In particular, the normal distribution, Type I and Type II errors.

The changing emphasis in statistics has made sample size determination a central task in research planning.  This page covers the probability theories behind sample size calculations and how they are calculated in a number of standard statistical models.

Algorithms for calculating sample sizes are based on the text book: Machin D, Campbell M, Fayers, P, Pinol A (1997) Sample Size Tables for Clinical Studies. Second Ed. Blackwell Science IBSN 0-86542-870-0.  Those interested in more detail about sample sizes may like to consult this excellent book.

Back to Top

Why sample size

As explained in the statistical significance page, statistical decisions became possible with the development of Type I error by Fisher.  A later improvement came through the use of Type II error and the statistical significance model by Pearson.  Pearson’s model was developed with the intension of providing a statistical decision. However, it provides a theoretical framework to help establish the mathematical relationships between Type I error, Type II error, a non trivial value that represents a difference that matters, Standard Deviation (SD) and sample size.  If four (4) of these parameters are known, then the fifth can be calculated.  Specifically, the sample size required to compare two groups can be estimated if the other four (4) parameters are nominated.  In other words, researchers are able to know the sample size they need for their results to be interpreted with confidence.

Understanding this model and the availability of an objective method to calculate optimal sample size allow further development and emphasis on sample size theories and practices.

At the technical level, statistical modelling allows sample size calculations to extend to data with different types of distributions (e.g., proportions, counts, durations, ranks) and  specialised research situations (e.g., phase II drug trials, post marketing trials, quality testing).  

From the researcher's point of view, the availability of sample size estimation when using common research models greatly assist planning and evaluation of research situations..

Knowledge of the appropriate sample size allows the researcher to estimate the time and resources required to complete the study and therefore, the feasibility and viability of the project.  This assists in resource planning and reduces the risk of research projects being incomplete.

An undersized study produces either uninterpretable results or results that will not stand the test of time.  On the other hand, a study larger than necessary wastes resources, inconveniences colleagues and impose unnecessary risks and discomfort to research subjects.

The absence of adequate sample size considerations symbolises poor research design and indicates bad and possibly unethical research.  Increasingly, if sample size considerations are inadequate, granting bodies will not support, regulating bodies will not approve, and editors of scientific publications will not accept results of the research project.

The importance of sample size is discussed clearly and comprehensively by Cohen J  (1992) A Power Primer. Psychological Bulletin Vol 112 No. 1 p. 155-159.  Participants are strongly encouraged to read this paper.

Back to Top

Nomenclature

Sample size requirement in a two group comparison depends on the four (4) parameters:

1.      Type I error

2.      Type II error

3.      A meaningful difference in the mean values between the two groups

4.      Standard Deviation of the measured outcomes in the population to be studied

Type I and Type II error are defined and discussed in the statistical significance page.   Although researchers should decide what they are, a common convention is to set the Type I error at 0.05 and Type II error at 0.2 (power of 80%)

A difference that matters

This is a nominated value and represents a difference that is clinically meaningful or of practical importance to the researcher.  Although a common practice is to obtain this value from published papers, this is unnecessary or even inappropriate, as the intension is to detect a difference that matters, not ones that have been found by others.

Population Standard Deviation

The population Standard Deviation (SD) represents the background variation.  Unfortunately, the SD can be difficult to estimate and the validity of the model depends on it. The SD of the data obtained cannot be guaranteed to be similar to that nominated for the research project.  This value is often based on published results, but is sometimes guessed (guesstimated).

Effect Size

Conceptually, the effect size (ES) is the difference to be detected compared with the background variation.  The sample size required is inversely related to the effect size.   In other words, the smaller the difference to be detected, or larger the background variation, then the sample size needed will be greater.

Effect size calculations under different research models

Sample size estimation requires knowledge of the effect size.  This is usually obtained from known population parameters or from published observations.  When research is planned for a novel situation, such parameters are not always available and some sort of guesstimate will have to be used.

Experience suggests that an effect size of 0.3 or less can be considered small and requires a very large sample size.  An effect size of 0.7 or more is large and requires only a small sample size.  Most clinical studies require an effect size between these two extremes.  An effect size of 0.5 can be used when no reliable data can be referenced.

Power

The term power = 1 – beta is commonly presented as a percentage.  A Type II error (beta) of 0.2 means power equals 80%, which is commonly used for sample size calculations.  The term power is intuitively easier to understand.  It means the ability to detect a difference if it is really there.

A power analysis can be carried out at the end of the study using the collected data.  This checks to see whether the nominated difference and the SD are approximate to those actually obtained in the data, and if not how the interpretation of the data should be modified.  Power analysis is particularly important if statistical decisions are based on statistical significance.  A statistically not significant conclusion is validated when the power in the data collected accurately reflects that proposed during planning.

With the increased use of confidence intervals, power analysis becomes less important.  Much of the information for decision making is conveyed through confidence intervals.  If a difference (less than which can be considered trivial) is defined along with a tolerance interval, then a conclusion can be drawn, as to whether the difference is large enough to be considered significant and/or small enough to be consider equivalent, or whether the data lacks power.

Back to Top

Comparison of two means

Back to Top

Comparison of two proportions

Back to Top

Unequal groups

In two group comparisons, the sample size in each group does not necessarily have to be equal. Two situations where this might occur are given below.

The first is an anticipation that cases are difficult to obtain in one of the groups, so a ratio between the two sample sizes is determined.  The second is a realisation that it is not possible to obtain more than a set number of cases in one group which falls short of sample size requirement, and requires an increase in the other group so that the total sample size is adequate.

Computer programs for sample size often include corrections for different sample sizes between groups.  However, if the total sample size is calculated based on equal numbers for each group a correction to obtain unequal group numbers is still possible.

The algorithm provided here calculates approximate adjustments required to obtain unequal size groups.  The original sample size must be based on calculations for equal size groups.  The reference is from the text book Gerald van Belle (2002). Statistical Rules of Thumb. John Wiley and Sons, New York. ISBN 0-471-40227-3. p. 45-46

Back to Top

Multiple comparisons

Sample size considerations so far have been based on the comparison of two groups which is the most common situation in research.  However, studies with multiple groups may arise under two common situations.

The first is when there are more than two groups.  For instance, comparing people from different religious affiliations or a range of social classes would require more than two groups.  The second situation is in the factorial design, where more than one influence is considered, such as the effects of sex and gestation on birth weight.

More elaborate models exist to calculate sample size under these considerations, based on whether all the groups are equally important, and when there is more than one factor, whether they interact.  A statistician's service is usually required for sample size estimation using these models.

One way to simplify the complexities is to consider multiple group studies as a series of two group studies that will take place simultaneously.  Here, the sample size is calculated for all pair-wise comparisons (grp1/grp2, grp1/grp3, grp2/grp3, etc).  The largest sample size estimated is used.

A refinement of this approach is to use the Bonferroni's correction, where Type I error is divided by the number of comparisons.  For example, if Type I error of 0.05 is used to reject the null hypothesis and there are three (3) groups (i.e., 3 comparisons), then the type I error used in sample size calculations should be 0.05 / 3 = 0.017.  The increase in precision by this procedure maybe trivial compared with the guesstimate of the difference to detect and the population Standard Deviation.  Therefore, this correction is not always used.

Back to Top

Version 1.0  Last change 6th July 2006