This page covers the historical development of Type I and Type II errors. The logic and controversy surrounding their development is explained along with some uses and misuses. Other statistical methodologies evolved from these concepts.
Explanations in this page assume that you have covered the material in the probability page. These concepts should be understood or at least be familiar to you before proceeding.
To the beginner in statistics, the vocabulary and concepts used in statistics appear convoluted and back to front, almost beyond comprehension. Therefore, we will do our best to simplify things.
Scientifically, it is difficult to know what the truth is about anything. Our knowledge is incomplete and we sometimes do not even know what knowledge we lack. Therefore, most scientists accept a view of reality as the best current estimate of truth until something better comes along.
Usually it is much easier to know whether something is not true. A theory can be dismissed when it is confronted with contradicting observations. Therefore, a useful approach is to establish a proposition or hypothesis and then try to disprove it. At least you know what is not true.
This approach is so often successful that it has become one of the standard ways of proving something. Instead of trying to demonstrate what is true, scientists try to demonstrate what is not true, what cannot be true, or in the case of statistics what is unlikely to be true.
Statistics is a method of handling observations and assumes that these observations approximate underlying realities. Decisions are therefore in terms of probability. The approach is to set up an idealised hypothesis and evaluate a set of observations against that hypothesis. Conclusions are drawn on the bases of how large the error may be if we accept or reject the hypothesis to be true.
The two major hypotheses used in statistics are the null hypothesis and the alternative hypothesis. Observations are evaluated against the hypotheses using Type I and Type II error.
It was Gauss who pointed out that repeated observations produce different measurements that cluster around a central value. Also, these observations are less frequently found as they fall further away from the central value. Gauss named this the Normal Distribution. It was De Moivre who approximated the binomial distribution curve to produce the mathematical expression for the Normal distribution. Although this was an interesting discovery of a natural phenomenon, not much use was made of this until Fisher came along.
Fisher produced four (4) related concepts that form the foundation of statistics.
Fisher used the mathematics of integration to calculate the area under the Normal distribution curve. The total area under this curve equals one (1) so any portion of the area has a value between 0 and 1. Probability is directly related to the area under the curve. Following this, Fisher devised the concept of Standard Deviation (SD) and defined the probability of obtaining a value greater than a particular SD value. The probability is defined by the area under the normal distribution curve greater than that value. The concept is presented in this diagram.
Fisher developed the random sampling theory, so that mean values obtained from measuring a few can be used to estimate the mean of the population. As the sample mean is an estimate, Fisher devised a measurement called the Standard Error of the mean (SE). Conceptually, the Standard Error of the mean is the standard deviation of all the mean values if samples of the same size were taken repeatedly. The concept of SE is represented in this diagram.
Fisher further developed the analysis of variance (ANOVA), where the variations of measurements are partitioned into within and between groups variance. From this, the difference between two means can be taken as an estimated difference of the two populations. The Standard Error of this difference is estimated from the partitioned variance. This concept is represented by this diagram.
Fisher then devised the Null Hypothesis; a hypothetical population with a mean of null (0) and a Standard Error the same value as the SE of the difference. The question asked is, “How likely is it to obtain a value greater than the observed difference if the Null Hypothesis is true?” Formally stated, Fisher asked, “What is the probability of Error if the Null Hypothesis is rejected?” The further the difference is from null, the smaller the error, and the more likely that the difference is not null. This error is called Type I error (abbreviated as alpha, α) and is often referred to as Fisher's p. The concept is represented by this diagram.
The development of Type I error allows an estimation of error in decision making. This facilitates objective judgement of whether a new treatment or intervention is successful or not. Such a tool allows a scientific approach to agriculture and manufacturing and was responsible for rapid scientific advances in the last century.
In practice, a number of problems have been identified.
Although robust conclusions can be drawn if Type I error is small, no conclusion could be drawn if Type I error is large. A large error in rejecting the Null Hypothesis does not necessarily mean that there is a small error in acceptance it.
Type I error is sensitive to sample size. When few observations are made, the Standard Error is large and so is the Type I error. A false conclusion that Null Hypothesis cannot be rejected can be made. With a large sample the reverse happens. The Null Hypothesis is rejected, but the difference may be trivial and of no practical importance.
Pearson disagreed with Fisher because he considered the Null Hypothesis as only half the story. Pearson proposed that another hypothesis should also be utilized, called the Alternative Hypothesis. It consists of a hypothetical population with the same Standard Error as the Null hypothesis, but a mean value that represents a meaningful difference between the two groups. Stating the Alternative hypothesis makes it possible to calculate Type II error. Type II error is the probability of error if the Alternative Hypothesis is rejected when it is true.
Type I and Type II errors for any observed difference can be estimated when both the Null and Alternative Hypotheses are used. Thus, Pearson argued that a decision to reject the Null Hypothesis or the Alternative Hypothesis can be made. The concept is represented by this diagram.
Although elegant, this model is difficult to use because it is difficult to logically nominate the mean of the Alternative Hypothesis. The concept was therefore reorganised in the following manner.
Nominate Type I Error to be used to reject the Null Hypothesis (usually 0.05)
Nominate Type II Error to reject the Alternative Hypothesis (usually 0.2)
Estimate background Standard Deviation (SD) of the population
Arbitrarily decide on a difference that is of practical importance, and use this as the mean of the alternative hypothesis
The sample size required to test this model can now be calculated, and a critical value for the difference found to enable a statistical decision can also be calculated.. At the end of collecting this sample, if the difference exceeds the critical value, the Null Hypothesis is rejected and Alternative Hypothesis accepted; the difference is then declared statistically significant. If the difference is less than the critical value, the Null Hypothesis accepted (Alternative Hypothesis is rejected) and the difference is said to be statistically not significant. The concept is represented by this diagram.
A statistician’s view point: Statistical tests are designed so that only the Null hypothesis is accepted or rejected. The Alternate hypothesis is never actually rejected and is stated above only to assist understanding. When the observed difference is less than the critical value, the Null is accepted, and when the observed difference is greater than the critical value, the Null is rejected. The Alternate hypothesis is automatically accepted when the Null is rejected. No matter what decision is made (Null accepted or rejected), sufficient power is always achieved when design specifications (i.e., SD) are met. Formally stated, Type II error is the probability of accepting the Null hypothesis when the Alternative hypothesis is true.
The Pearson's model of statistical significance became widely accepted and was responsible for many advances in psychosocial and clinical medical research. With usage, however, some of the weaknesses in the model became apparent. Towards the end of the century there was an attempt to move away from this model. Some of the problems are as follows.
The model is only valid if all the parameters used are reflected in the data. The population SD nominated is often different to the observed data. This distorts the conclusion, particularly if the difference is close to the critical value.
The mean difference obtained may be less than the critical value, but because of differences in SD, the Type I error may be less than 0.05. An erroneous conclusion that the difference is statistically significant is then made. A significant difference that may no longer be meaningful is found.
The weaknesses of Pearson's model became increasingly apparent. Some of the conclusions derived from that model could not stand the test of time and a number of new methods evolved to compensate for these weaknesses.
Post hoc power analysis: Check that the planned model remains valid by re-estimating Type I and Type II errors using the data collected.
Meta-Analysis: Recognise that the results of any single study could be flawed. Conclusions are safer if a number of studies are pooled and the data re-analysed.
The Confidence Interval Model: Related to Type I error, but instead the difference and its Standard Error is calculated. A confidence interval of the difference is then determined. This approach addressed much of the problems from using Type I Error and does not incur the risk of misinterpretation that exists with Pearson's model.
The development of Pearson’s model towards the estimation of sample size to ensure research projects contains the appropriate amount of data for adequate interpretation.