n–1
The table above shows only the t -tests for population means. Another common t -test is for correlation coefficients . You use this t -test to decide if the correlation coefficient is significantly different from zero.
When you define the hypothesis, you also define whether you have a one-tailed or a two-tailed test. You should make this decision before collecting your data or doing any calculations. You make this decision for all three of the t -tests for means.
To explain, let’s use the one-sample t -test. Suppose we have a random sample of protein bars, and the label for the bars advertises 20 grams of protein per bar. The null hypothesis is that the unknown population mean is 20. Suppose we simply want to know if the data shows we have a different population mean. In this situation, our hypotheses are:
$ \mathrm H_o: \mu = 20 $
$ \mathrm H_a: \mu \neq 20 $
Here, we have a two-tailed test. We will use the data to see if the sample average differs sufficiently from 20 – either higher or lower – to conclude that the unknown population mean is different from 20.
Suppose instead that we want to know whether the advertising on the label is correct. Does the data support the idea that the unknown population mean is at least 20? Or not? In this situation, our hypotheses are:
$ \mathrm H_o: \mu >= 20 $
$ \mathrm H_a: \mu < 20 $
Here, we have a one-tailed test. We will use the data to see if the sample average is sufficiently less than 20 to reject the hypothesis that the unknown population mean is 20 or higher.
See the "tails for hypotheses tests" section on the t -distribution page for images that illustrate the concepts for one-tailed and two-tailed tests.
For all of the t -tests involving means, you perform the same steps in analysis:
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .
Tae kyun kim.
Department of Anesthesia and Pain Medicine, Pusan National University School of Medicine, Busan, Korea.
In statistic tests, the probability distribution of the statistics is important. When samples are drawn from population N (µ, σ 2 ) with a sample size of n, the distribution of the sample mean X ̄ should be a normal distribution N (µ, σ 2 / n ). Under the null hypothesis µ = µ 0 , the distribution of statistics z = X ¯ - µ 0 σ / n should be standardized as a normal distribution. When the variance of the population is not known, replacement with the sample variance s 2 is possible. In this case, the statistics X ¯ - µ 0 s / n follows a t distribution ( n-1 degrees of freedom). An independent-group t test can be carried out for a comparison of means between two independent groups, with a paired t test for paired data. As the t test is a parametric test, samples should meet certain preconditions, such as normality, equal variances and independence.
A t test is a type of statistical test that is used to compare the means of two groups. It is one of the most widely used statistical hypothesis tests in pain studies [ 1 ]. There are two types of statistical inference: parametric and nonparametric methods. Parametric methods refer to a statistical technique in which one defines the probability distribution of probability variables and makes inferences about the parameters of the distribution. In cases in which the probability distribution cannot be defined, nonparametric methods are employed. T tests are a type of parametric method; they can be used when the samples satisfy the conditions of normality, equal variance, and independence.
T tests can be divided into two types. There is the independent t test, which can be used when the two groups under comparison are independent of each other, and the paired t test, which can be used when the two groups under comparison are dependent on each other. T tests are usually used in cases where the experimental subjects are divided into two independent groups, with one group treated with A and the other group treated with B. Researchers can acquire two types of results for each group (i.e., prior to treatment and after the treatment): preA and postA, and preB and postB. An independent t test can be used for an intergroup comparison of postA and postB or for an intergroup comparison of changes in preA to postA (postA-preA) and changes in preB to postB (postB-preB) ( Table 1 ).
Treatment A | Treatment B | |||||||
---|---|---|---|---|---|---|---|---|
ID | preA | postA | ΔA | ID | preB | postB | ΔB | |
1 | 63 | 77 | 14 | 11 | 81 | 101 | 20 | |
2 | 69 | 88 | 19 | 12 | 87 | 103 | 16 | |
3 | 76 | 90 | 14 | 13 | 77 | 107 | 30 | |
4 | 78 | 95 | 17 | 14 | 80 | 114 | 34 | |
5 | 80 | 96 | 16 | 15 | 76 | 116 | 40 | |
6 | 89 | 96 | 7 | 16 | 86 | 116 | 30 | |
7 | 90 | 102 | 12 | 17 | 98 | 116 | 18 | |
8 | 92 | 104 | 12 | 18 | 87 | 120 | 33 | |
9 | 103 | 110 | 7 | 19 | 105 | 120 | 15 | |
10 | 112 | 115 | 3 | 20 | 69 | 127 | 58 |
ID: individual identification, preA, preB: before the treatment A or B, postA, postB: after the treatment A or B, ΔA, ΔB: difference between before and after the treatment A or B.
On the other hand, paired t tests are used in different experimental environments. For example, the experimental subjects are not divided into two groups, and all of them are treated initially with A. The amount of change (postA-preA) is then measured for all subjects. After all of the effects of A disappear, the subjects are treated with B, and the amount of change (postB-preB) is measured for all of the subjects. A paired t test is used in such crossover test designs to compare the amount of change of A to that of B for the same subjects ( Table 2 ).
Treatment A | Treatment B | |||||||
---|---|---|---|---|---|---|---|---|
ID | preA | postA | ΔA | ID | preB | postB | ΔB | |
1 | 63 | 77 | 14 | 1 | 73 | 103 | 30 | |
2 | 69 | 88 | 19 | 2 | 74 | 104 | 30 | |
3 | 76 | 90 | 14 | 3 | 76 | 107 | 31 | |
4 | 78 | 95 | 17 | 4 | 84 | 108 | 24 | |
5 | 80 | 96 | 16 | wash out | 5 | 84 | 110 | 26 |
6 | 89 | 96 | 7 | 6 | 86 | 110 | 24 | |
7 | 90 | 102 | 12 | 7 | 92 | 113 | 21 | |
8 | 92 | 104 | 12 | 8 | 95 | 114 | 19 | |
9 | 103 | 110 | 7 | 9 | 103 | 118 | 15 | |
10 | 112 | 115 | 3 | 10 | 115 | 120 | 5 |
Statistics is basically about probabilities. A statistical conclusion of a large or small difference between two groups is not based on an absolute standard but is rather an evaluation of the probability of an event. For example, a clinical test is performed to determine whether or not a patient has a certain disease. If the test results are either higher or lower than the standard, clinicians will determine that the patient has the disease despite the fact that the patient may or may not actually have the disease. This conclusion is based on the statistical concept which holds that it is more statistically valid to conclude that the patient has the disease than to declare that the patient is a rare case among people without the disease because such test results are statistically rare in normal people.
The test results and the probability distribution of the results must be known in order for the results to be determined as statistically rare. The criteria for clinical indicators have been established based on data collected from an entire population or at least from a large number of people. Here, we examine a case in which a clinical indicator exhibits a normal distribution with a mean of µ and a variance of σ 2 . If a patient's test result is χ, is this statistically rare against the criteria (e.g., 5 or 1%)? Probability is represented as the surface area in a probability distribution, and the z score that represents either 5 or 1%, near the margins of the distribution, becomes the reference value. The test result χ can be determined to be statistically rare compared to the reference probability if it lies in a more marginal area than the z score, that is, if the value of χ is located in the marginal ends of the distribution ( Fig. 1 ).
This is done to compare one individual's clinical indicator value. This however raises the question of how we would compare the mean of a sample group (consisting of more than one individual) against the population mean. Again, it is meaningless to compare each individual separately; we must compare the means of the two groups. Thus, do we make a statistical inference using only the distribution of the clinical indicators of the entire population and the mean of the sample? No. In order to infer a statistical possibility, we must know the indicator of interest and its probability distribution. In other words, we must know the mean of the sample and the distribution of the mean. We can then determine how far the sample mean varies from the population mean by knowing the sampling distribution of the means.
The sample mean we can get from a study is one of means of all possible samples which could be drawn from a population. This sample mean from a study was already acquired from a real experiment, however, how could we know the distribution of the means of all possible samples including studied sample? Do we need to experiment it over and over again? The simulation in which samples are drawn repeatedly from a population is shown in Fig. 2 . If samples are drawn with sample size n from population of normal distribution (µ, σ 2 ), the sampling distribution shows normal distribution with mean of µ and variance of σ 2 / n . The number of samples affects the shape of the sampling distribution. That is, the shape of the distribution curve becomes a narrower bell curve with a smaller variance as the number of samples increases, because the variance of sampling distribution is σ 2 / n . The formation of a sampling distribution is well explained in Lee et al. [ 2 ] in a form of a figure.
Now that the sampling distribution of the means is known, we can locate the position of the mean of a specific sample against the distribution data. However, one problem remains. As we noted earlier, the sampling distribution exhibits a normal distribution with a variance of σ 2 / n , but in reality we do not know σ 2 , the variance of the population. Therefore, we use the sample variance instead of the population variance to determine the sampling distribution of the mean. The sample variance is defined as follows:
In such cases in which the sample variance is used, the sampling distribution follows a t distribution that depends on the 0degree of freedom of each sample rather than a normal distribution ( Fig. 3 ).
A t test is also known as Student's t test. It is a statistical analysis technique that was developed by William Sealy Gosset in 1908 as a means to control the quality of dark beers. A t test used to test whether there is a difference between two independent sample means is not different from a t test used when there is only one sample (as mentioned earlier). However, if there is no difference in the two sample means, the difference will be close to zero. Therefore, in such cases, an additional statistical test should be performed to verify whether the difference could be said to be equal to zero.
Let's extract two independent samples from a population that displays a normal distribution and compute the difference between the means of the two samples. The difference between the sample means will not always be zero, even if the samples are extracted from the same population, because the sampling process is randomized, which results in a sample with a variety of combinations of subjects. We extracted two samples with a size of 6 from a population N (150, 5 2 ) and found the difference in the means. If this process is repeated 1,000 times, the sampling distribution exhibits the shape illustrated in Fig. 4 . When the distribution is displayed in terms of a histogram and a density line, it is almost identical to the theoretical sampling distribution: N(0, 2 × 5 2 /6) ( Fig. 4 ).
However, it is difficult to define the distribution of the difference in the two sample means because the variance of the population is unknown. If we use the variance of the sample instead, the distribution of the difference of the samples means would follow a t distribution. It should be noted, however, that the two samples display a normal distribution and have an equal variance because they were independently extracted from an identical population that has a normal distribution.
Under the assumption that the two samples display a normal distribution and have an equal variance, the t statistic is as follows:
population mean difference (µ 1 - µ 2 ) was assumed to be 0; thus:
The population variance was unknown and so a pooled variance of the two samples was used:
However, if the population variance is not equal, the t statistic of the t test would be
and the degree of freedom is calculated based on the Welch Satterthwaite equation.
It is apparent that if n 1 and n 2 are sufficiently large, the t statistic resembles a normal distribution ( Fig. 3 ).
A statistical test is performed to verify the position of the difference in the sample means in the sampling distribution of the mean ( Fig. 4 ). It is statistically very rare for the difference in two sample means to lie on the margins of the distribution. Therefore, if the difference does lie on the margins, it is statistically significant to conclude that the samples were extracted from two different populations, even if they were actually extracted from the same population.
Paired t tests are can be categorized as a type of t test for a single sample because they test the difference between two paired results. If there is no difference between the two treatments, the difference in the results would be close to zero; hence, the difference in the sample means used for a paired t test would be 0.
Let's go back to the sampling distribution that was used in the independent t test discussed earlier. The variance of the difference between two independent sample means was represented as the sum of each variance. If the samples were not independent, the variance of the difference of two variables A and B, Var (A-B), can be shown as follows,
where σ 1 2 is the variance of variable A, σ 2 2 is the variance of variable B, and ρ is the correlation coefficient for the two variables. In an independent t test, the correlation coefficient is 0 because the two groups are independent. Thus, it is logical to show the variance of the difference between the two variables simply as the sum of the two variances. However, for paired variables, the correlation coefficient may not equal 0. Thus, the t statistic for two dependent samples must be different, meaning the following t statistic,
must be changed. First, the number of samples are paired; thus, n 1 = n 2 = n , and their variance can be represented as s 1 2 + s 2 2 - 2ρ s 1 s 2 considering the correlation coefficient. Therefore, the t statistic for a paired t test is as follows:
In this equation, the t statistic is increased if the correlation coefficient is greater than 0 because the denominator becomes smaller, which increases the statistical power of the paired t test compared to that of an independent t test. On the other hand, if the correlation coefficient is less than 0, the statistical power is decreased and becomes lower than that of an independent t test. It is important to note that if one misunderstands this characteristic and uses an independent t test when the correlation coefficient is less than 0, the generated results would be incorrect, as the process ignores the paired experimental design.
As previously explained, if samples are extracted from a population that displays a normal distribution but the population variance is unknown, we can use the sample variance to examine the sampling distribution of the mean, which will resemble a t distribution. Therefore, in order to reach a statistical conclusion about a sample mean with a t distribution, certain conditions must be satisfied: the two samples for comparison must be independently sampled from the same population, satisfying the conditions of normality, equal variance, and independence.
Shapiro's test or the Kolmogorov-Smirnov test can be performed to verify the assumption of normality. If the condition of normality is not met, the Wilcoxon rank sum test (Mann-Whitney U test) is used for independent samples, and the Wilcoxon sign rank test is used for paired samples for an additional nonparametric test.
The condition of equal variance is verified using Levene's test or Bartlett's test. If the condition of equal variance is not met, nonparametric test can be performed or the following statistic which follows a t distribution can is used.
However, this statistics has different degree of freedom which was calculated by the Welch-Satterthwaite [ 3 , 4 ] equation.
Owing to user-friendly statistics software programs, the rich pool of statistics information on the Internet, and expert advice from statistics professionals at every hospital, using and processing statistics data is no longer an intractable task. However, it remains the researchers' responsibility to design experiments to fulfill all of the conditions of their statistic methods of choice and to ensure that their statistical assumptions are appropriate. In particular, parametric statistical methods confer reasonable statistical conclusions only when the statistical assumptions are fully met. Some researchers often regard these statistical assumptions inconvenient and neglect them. Even some statisticians argue on the basic assumptions, based on the central limit theory, that sampling distributions display a normal distribution regardless of the fact that the population distribution may or may not follow a normal distribution, and that t tests have sufficient statistical power even if they do not satisfy the condition of normality [ 5 ]. Moreover, they contend that the condition of equal variance is not so strict because even if there is a ninefold difference in the variance, the α level merely changes from 0.5 to 0.6 [ 6 ]. However, the arguments regarding the conditions of normality and the limit to which the condition of equal variance may be violated are still bones of contention. Therefore, researchers who unquestioningly accept these arguments and neglect the basic assumptions of a t test when submitting papers will face critical comments from editors. Moreover, it will be difficult to persuade the editors to neglect the basic assumptions regardless of how solid the evidence in the paper is. Hence, researchers should sufficiently test basic statistical assumptions and employ methods that are widely accepted so as to draw valid statistical conclusions.
The results of independent and paired t tests of the examples are illustrated in Tables 1 and 2. The tests were conducted using the SPSS Statistics Package (IBM® SPSS® Statistics 21, SPSS Inc., Chicago, IL, USA).
First, we need to examine the degree of normality by confirming the Kolmogorov-Smirnov or Shapiro-Wilk test in the second table. We can determine that the samples satisfy the condition of normality because the P value is greater than 0.05. Next, we check the results of Levene's test to examine the equality of variance. The P value is again greater than 0.05; hence, the condition of equal variance is also met. Finally, we read the significance probability for the "equal variance assumed" line. If the condition of equal variance is not met (i.e., if the P value is less than 0.05 for Levene's test), we reach a conclusion by referring to the significance probability for the "equal variance not assumed" line, or we perform a nonparametric test.
A paired t test is identical to a single-sample t test. Therefore, we test the normality of the difference in the amount of change for treatment A and treatment B (ΔA-ΔB). The normality is verified based on the results of Kolmogorov-Smirnov and Shapiro-Wilk tests, as shown in the second table. In conclusion, there is a significant difference between the two treatments (i.e., the P value is less than 0.001).
BMC Medical Research Methodology volume 12 , Article number: 81 ( 2012 ) Cite this article
63k Accesses
147 Citations
15 Altmetric
Metrics details
Student’s two-sample t test is generally used for comparing the means of two independent samples, for example, two treatment arms. Under the null hypothesis, the t test assumes that the two samples arise from the same normally distributed population with unknown variance. Adequate control of the Type I error requires that the normality assumption holds, which is often examined by means of a preliminary Shapiro-Wilk test. The following two-stage procedure is widely accepted: If the preliminary test for normality is not significant, the t test is used; if the preliminary test rejects the null hypothesis of normality, a nonparametric test is applied in the main analysis.
Equally sized samples were drawn from exponential, uniform, and normal distributions. The two-sample t test was conducted if either both samples (Strategy I) or the collapsed set of residuals from both samples (Strategy II) had passed the preliminary Shapiro-Wilk test for normality; otherwise, Mann-Whitney’s U test was conducted. By simulation, we separately estimated the conditional Type I error probabilities for the parametric and nonparametric part of the two-stage procedure. Finally, we assessed the overall Type I error rate and the power of the two-stage procedure as a whole.
Preliminary testing for normality seriously altered the conditional Type I error rates of the subsequent main analysis for both parametric and nonparametric tests. We discuss possible explanations for the observed results, the most important one being the selection mechanism due to the preliminary test. Interestingly, the overall Type I error rate and power of the entire two-stage procedure remained within acceptable limits.
The two-stage procedure might be considered incorrect from a formal perspective; nevertheless, in the investigated examples, this procedure seemed to satisfactorily maintain the nominal significance level and had acceptable power properties.
Peer Review reports
Statistical tests have become more and more important in medical research [ 1 – 3 ], but many publications have been reported to contain serious statistical errors [ 4 – 10 ]. In this regard, violation of distributional assumptions has been identified as one of the most common problems: According to Olsen [ 9 ], a frequent error is to use statistical tests that assume a normal distribution on data that are actually skewed. With small samples, Neville et al. [ 10 ] considered the use of parametric tests erroneous unless a test for normality had been conducted before. Similarly, Strasak et al. [ 7 ] criticized that contributors to medical journals often failed to examine and report that assumptions had been met when conducting Student’s t test.
Probably one of the most popular research questions is whether two independent samples differ from each other. Altman, for example, stated that “most clinical trials yield data of this type, as do observational studies comparing different groups of subjects” ([ 11 ], p. 191). In Student’s t test, the expectations of two populations are compared. The test assumes independent sampling from normal distributions with equal variance. If these assumptions are met and the null hypothesis of equal population means holds true, the test statistic T follows a t distribution with n X + n Y – 2 degrees of freedom:
where m X and m Y are the observed sample means, n X and n Y are the sample sizes of the two groups, and s is an estimate of the common standard deviation. If the assumptions are violated, T is compared with the wrong reference distribution, which may result in a deviation of the actual Type I error from the nominal significance level [ 12 , 13 ], in a loss of power relative to other tests developed for similar problems [ 14 ], or both. In medical research, normally distributed data are the exception rather than the rule [ 15 , 16 ]. In such situations, the use of parametric methods is discouraged, and nonparametric tests (which are also referred to as distribution-free tests) such as the two-sample Mann–Whitney U test are recommended instead [ 11 , 17 ].
Guidelines for contributions to medical journals emphasize the importance of distributional assumptions [ 18 , 19 ]. Sometimes, special recommendations are provided. When addressing the question of how to compare changes from baseline in randomized clinical trials if data do not follow a normal distribution, Vickers, for example, concluded that such data are best analyzed with analysis of covariance [ 20 ]. In clinical trials, a detailed description of the statistical analysis is mandatory [ 21 ]. This description requires good knowledge about the clinical endpoints, which is often limited. Researchers, therefore, tend to specify alternative statistical procedures in case the underlying assumptions are not satisfied (e.g., [ 22 ]). For the t test, Livingston [ 23 ] presented a list of conditions that must be considered (e.g., normal distribution, equal variances, etc.). Consequently, some researchers routinely check if their data fulfill the assumptions and change the analysis method if they do not (for a review, see [ 24 ]).
In a preliminary test, a specific assumption is checked; the outcome of the pretest then determines which method should be used for assessing the main hypothesis [ 25 – 28 ]. For the paired t test, Freidlin et al. ([ 29 ], p. 887) referred to as “a natural adaptive procedure (…) to first apply the Shapiro-Wilk test to the differences: if normality is accepted, the t test is used; otherwise the Wilcoxon signed ranked test is used.” Similar two-stage procedures including a preliminary test for normality are common for two-sample t tests [ 30 , 31 ]. Therefore, conventional statistical practice for comparing continuous outcomes from two independent samples is to use a pretest for normality (H 0 : “The true distribution is normal” against H 1 : “The true distribution is non-normal”) at significance level α pre before testing the main hypothesis. If the pretest is not significant, the statistic T is used to test the main hypothesis of equal population means at significance level α . If the pretest is significant, Mann-Whitney’s U test may be applied to compare the two groups. Such a two-stage procedure ( Additional file 1 ) appears logical, and goodness-of-fit tests for normality are frequently reported in articles [ 32 – 35 ].
Some authors have recently warned against preliminary testing [ 24 , 36 – 45 ]. First of all, theoretical drawbacks exist with regard to the preliminary testing of assumptions. The basic difficulty of a typical pretest is that the desired result is often the acceptance of the null hypothesis. In practice, the conclusion about the validity of, for example, the normality assumption is then implicit rather than explicit: Because insufficient evidence exists to reject normality, normality will be considered true. In this context, Schucany and Ng [ 41 ] speak about a “logical problem”. Further critiques of preliminary testing focused on the fact that assumptions refer to characteristics of populations and not to characteristics of samples. In particular, small to moderate sample sizes do not guarantee matching of the sample distribution with the population distribution. For example, Altman ([ 11 ], Figure 4.7, p. 60) showed that even sample sizes of 50 taken from a normal distribution may look non-normal. Second, some preliminary tests are accompanied by their own underlying assumptions, raising the question of whether these assumptions also need to be examined. In addition, even if the preliminary test indicates that the tested assumption does not hold, the actual test of interest may still be robust to violations of this assumption. Finally, preliminary tests are usually applied to the same data as the subsequent test, which may result in uncontrolled error rates. For the one-sample t test, Schucany and Ng [ 41 ] conducted a simulation study of the consequences of the two-stage selection procedure including a preliminary test for normality. Data were sampled from normal, uniform, exponential, and Cauchy populations. The authors estimated the Type I error rate of the one-sample t test, given that the sample had passed the Shapiro-Wilk test for normality with a p value greater than α pre . For exponentially distributed data, the conditional Type I error rate of the main test turned out to be strikingly above the nominal significance level and even increased with sample size. For two-sample tests, Zimmerman [ 42 – 45 ] addressed the question of how the Type I error and power are modified if a researcher’s choice of test (i.e., t test for equal versus unequal variances) is based on sample statistics of variance homogeneity. Zimmerman concluded that choosing the pooled or separate variance version of the t test solely on the inspection of the sample data does neither maintain the significance level nor protect the power of the procedure. Rasch et al. [ 39 ] assessed the statistical properties of a three-stage procedure including testing for normality and for homogeneity of the variances. The authors concluded that assumptions underlying the two-sample t test should not be pre-tested because “pre-testing leads to unknown final Type I and Type II risks if the respective statistical tests are performed using the same set of observations”. Interestingly, none of the studies cited above explicitly addressed the unconditional error rates of the two-stage procedure as a whole. The studies rather focused on the conditional error rates, that is, the Type I and Type II error of single arms of the two-stage procedure.
In the present study, we investigated the statistical properties of Student’s t test and Mann-Whitney’s U test for comparing two independent groups with different selection procedures. Similar to Schucany and Ng [ 41 ], the tests to be applied were chosen depending on the results of the preliminary Shapiro-Wilk tests for normality of the two samples involved. We thereby obtained an estimate of the conditional Type I error rates for samples that were classified as normal although the underlying populations were in fact non-normal, and vice-versa. This probability reflects the error rate researchers may face with respect to the main hypothesis if they mistakenly believe the normality assumption to be satisfied or violated. If, in addition, the power of the preliminary Shapiro-Wilk test is taken into account, the potential impact of the entire two-stage procedure on the overall Type I error rate and power can be directly estimated.
In our simulation study, equally sized samples for two groups were drawn from three different distributions, covering a variety of shapes of data encountered in clinical research. Two selection strategies were examined for the main test to be applied. In Strategy I, the two-sample t test was conducted if both samples had passed the preliminary Shapiro-Wilk test for normality; otherwise, we applied Mann-Whitney’s U test. In Strategy II, the t test was conducted if the residuals ( x i − m X ) , ( y i − m Y ) from both samples had passed the pretest; otherwise, we used the U test. The difference between the two strategies is that, in Strategy I, the Shapiro-Wilk test for normality is separately conducted on raw data from each sample, whereas in Strategy II, the preliminary test is applied only once, i.e. to the collapsed set of residuals from both samples.
Statistical language R 2.14.0 [ 46 ] was used for the simulations. Random sample pairs of size n X = n Y = 10, 20, 30, 40, 50 were generated from the following distributions: (1) exponential distribution with unit expectation and variance; (2) uniform distribution in [0, 1]; and (3) the standard normal distribution. This procedure was repeated until 10,000 pairs of samples had passed the preliminary screening for normality (either Strategy I or II, with α pre = .100, .050, .010, .005, or no pretest). For these samples, the null hypothesis μ X = μ Y was tested against the alternative μ X ≠ μ Y using Student’s t test at the two-sided significance level α = .05. The conditional Type I errors rates (left arm of the decision tree in Additional file 1 ) were then estimated by the number of significant t tests divided by 10,000. The precision of the results thereby amounts to maximally ±1% (width of the 95% confidence interval for proportion 0.5). In a second run, sample generation was repeated until 10,000 pairs were collected that had failed preliminary screening for normality (Strategy I or II), and the conditional Type I error was estimated for Mann-Whitney’s U test (right part of Additional file 1 ).
Finally, 100,000 pairs of samples were generated from exponential, uniform, and normal distributions to assess the unconditional Type I error of the entire two-stage procedure. Depending on whether the preliminary Shapiro-Wilk test was significant or not, Mann-Whitney’s U test or Student’s t test was conducted for the main analysis. The Type I error rate of the entire two-stage procedure was estimated by the number of significant tests ( t or U ) and division by 100,000.
The first strategy required both samples to pass the preliminary screening for normality to proceed with the two-sample t test; otherwise, we used Mann-Whitney’s U test. This strategy was motivated by the well-known assumption that the two-sample t test requires data within each of the two groups to be sampled from normally distributed populations (e.g., [ 11 ]).
Table 1 (left) summarizes the estimated conditional Type I error probabilities of the standard two-sample t test (i.e., t test assuming equal variances) at the two-sided nominal level α = .05 after both samples had passed the Shapiro-Wilk test for normality, as well as the unconditional Type I error rate of the t test without a pretest for normality. Figure 1 additionally plots the corresponding estimates if the underlying distribution was either (A) exponential, (B) uniform, or (C) normal. As can be seen from Table 1 and Figure 1 , the unconditional two-sample t test (i.e., without pretest) was α -robust, even if the underlying distribution was exponential or uniform. In contrast, the observed conditional Type I error rates differed from the nominal significance level. For the exponential distribution, the selective application of the two-sample t test to pairs of samples that had been accepted as normal led to Type I error rates of the final t test that were considerably larger than α = .05 (Figure 1 A). Moreover, the violation of the significance level increased with sample size and α pre . For example, for n = 30, the observed Type I error rates of the two-sample t test turned out to be 10.8% for α pre = .005 and even 17.0% for α pre = .100, whereas the unconditional Type I error rate was 4.7%. If the underlying distribution was uniform, the conditional Type I error rates declined below the nominal level, particularly as samples became larger and preliminary significance levels increased (Figure 1 B). For normally distributed populations, conditional and unconditional Type I error rates roughly followed the nominal significance level (Figure 1 C).
Estimated Type I error probability of the two-sample t test at α = .05 after both samples had passed the Shapiro-Wilk test for normality at α pre = .100, .050, .010, .005 (conditional), and without pretest (unconditional). Samples of equal size from the ( A ) exponential, ( B ) uniform, and ( C ) normal distribution.
For pairs in which at least one sample had not passed the pretest for normality, we conducted Mann-Whitney’s U test. The estimated conditional Type I error probabilities are summarized in Table 1 (right): For exponential samples, only a negligible tendency towards conservative decisions was observed, but samples from the uniform distribution, and, to a lesser extent, samples from the normal distribution proved problematic. In contrast to the pattern observed for the conditional t test, however, the nominal significance level was mostly violated in small samples and numerically low significance levels of the pretest (e.g., α pre = .005).
The two-sample t test is a special case of a linear model that assumes independent normally distributed errors. Therefore, the normality assumption can be examined through residuals instead of raw data. In linear models, residuals are defined as differences between observed and expected values. In the two-sample comparison, the expected value for a measurement corresponds to the mean of the sample from which it derived, so that the residual simplifies to the difference between the observed value and the sample mean. In regression modeling, the assumption of normality is often checked by the plotting of residuals after parameter estimation. However, this order may be reversed, and formal tests of normality based on residuals may be carried out. In Strategy II, one single Shapiro-Wilk test was applied to the collapsed set of residuals from both samples; thus, in contrast to Strategy I, only one pretest for normality had to be passed.
Table 2 (left) and Figure 2 show the estimated conditional Type I error probabilities of the two-sample t test at α = .05 (two-sided) after residuals had passed the Shapiro-Wilk test for the three different underlying distributions and for different α pre levels as well as the corresponding unconditional Type I error rates (i.e., without pretest). For the normal distribution, the conditional and the unconditional Type I error rates were very close to the nominal significance level for all sample sizes and α pre levels considered. Thus, if the underlying distribution was normal, the preliminary Shapiro-Wilk test for normality of the residuals did not affect the Type I error probability of the subsequent two-sample t test.
Estimated Type I error probability of the two-sample t test at α = .05 after the residuals had passed the Shapiro-Wilk test for normality at α pre = .100, .050, .010, .005 (conditional), and without pretest (unconditional). Samples of equal size from the ( A ) exponential, ( B ) uniform, and ( C ) normal distribution.
For the two other distributions, the results were strikingly different. For samples from the exponential distribution, conditional Type I error rates were much larger than the nominal significance level (Figure 2 A). For example, at α pre = .005, conditional Type I error rates ranged between 6.4% for n = 10 up to 79.2% in samples of n = 50. For the largest preliminary α pre level of .100, samples of n = 30 reached error rates above 70%. Thus, the discrepancy between the observed Type I error rate and the nominal α was even more pronounced than for Strategy I and increased again with growing preliminary α pre and increasing sample size.
Surprisingly and in remarkable contrast to the results observed for Strategy I, samples from the uniform distribution that had passed screening for normality of residuals also led to conditional Type I error rates that were far above 5% (Figure 2 B). The distortion of the Type I error rate was only slightly less extreme for the uniform than for the exponential distribution, resulting in error rates up to 50%. The conditional Type I error rate increased again with growing sample size and increasing preliminary significance level of the Shapiro-Wilk test. For example, at α pre = .100, conditional Type I error rates were between 6.5% for n = 10 and even 52.9% for n = 50. Similarly, in samples of n = 50, the conditional Type I error rate was between 13.8% for α pre = .005 and 52.9% for α pre = .100, whereas the Type I error rate without pretest was close to 5.0%.
If the distribution of the residuals was judged as non-normal by the preliminary Shapiro-Wilk test, the two samples were compared by means of Mann-Whitney’s U test (Table 2 , right). As for Strategy I, the Type I error rate of the conditional U test was closest to the nominal α for samples from the exponential distribution. For samples from the uniform distribution, the U test did not fully exhaust the significance level but showed remarkably anti-conservative behavior for samples drawn from the normal distribution, which was most pronounced in small samples and numerically low α pre .
Biased decisions within the two arms of the decision tree in Additional file 1 are mainly a matter of theoretical interest, whereas the unconditional Type I error and power of the two-stage procedure reflect how the algorithm works in practice. Therefore, we directly assessed the practical consequences of the entire two-stage procedure with respect to the overall, unconditional, Type I error. This evaluation was additionally motivated by the anticipation that, although the observed conditional Type I error rates of both the main parametric test and the nonparametric test were seriously altered by screening for normality, these results will rarely occur in practice because the Shapiro-Wilk test is very powerful in large samples. Again, pairs of samples were generated from exponential, uniform, and normal distributions. Depending on whether the preliminary Shapiro-Wilk test was significant or not, Mann-Whitney’s U test or Student’s t test was conducted in the main analysis. Table 3 outlines the estimated unconditional Type I error rates. In line with this expectation, the results show that the two-stage procedure as a whole can be considered robust with respect to the unconditional Type I error rate. This holds true for all three distributions considered, irrespectively of the strategy chosen for the preliminary test.
Because the two-stage procedure seemed to keep the nominal significance level, we additionally investigated the corresponding statistical power. To this end, 100,000 pairs of samples were drawn from unit variance normal distributions with means 0.0 and 0.6, from uniform distributions in [0.0, 1.0] and [0.2, 1.2], and from exponential distributions with rate parameters 1.0 and 2.0.
As Table 4 shows, statistical power to detect a shift in two normal distributions corresponds to the weighted sum of the power of the unconditional use of Student’s t test and Mann-Whitney’s U test. When both samples must pass the preliminary test for normality (Strategy I), the weights correspond to (1 – α pre ) 2 and 1 – (1 – α pre ) 2 respectively, which is consistent with the rejection rate of the Shapiro-Wilk test under the normality assumption. For Strategy II, the weights roughly correspond to 1 – α pre and α pre respectively (a minimal deviation can be expected here because the residuals from the two samples are not completely independent). Similar results were observed for shifted uniform distributions and exponential distributions with different rate parameters: In both distributions, the overall power of the two-stage procedure seemed to lie in-between the power estimated for the unconditional t test and the U test.
The appropriateness of a statistical test, which depends on underlying distributional assumptions, is generally not a problem if the population distribution is known in advance. If the assumption of normality is known to be wrong, a nonparametric test may be used that does not require normally distributed data. Difficulties arise if the population distribution is unknown—which, unfortunately, is the most common scenario in medical research. Many statistical textbooks and articles state that assumptions should be checked before conducting statistical tests, and that tests should be chosen depending on whether the assumptions are met (e.g., [ 22 , 28 , 47 , 48 ]). Various options for testing assumptions are easily available and sometimes even automatically generated within the standard output of statistical software (e.g., see SAS or SPSS for the assumption of variance homogeneity for the t test; for a discussion see [ 42 – 45 ]). Similarly, methodological guidelines for clinical trials generally recommend checking for conditions underlying statistical methods. According to ICH E3, for example, when presenting the results of a statistical analysis, researchers should demonstrate that the data satisfied the crucial underlying assumptions of the statistical test used [ 49 ]. Although it is well-known that decision-making after inspection of sample data can lead to altered Type I and Type II error probabilities and sometimes to spurious rejection of the null hypothesis, researchers are often confused or unaware of the potential shortcomings of such two-stage procedures.
We demonstrated the dramatic effects of preliminary testing for normality on the conditional Type I error rate of the main test (see Tables 1 and 2 , and Figures 1 and 2 ). Most of these consequences were qualitatively similar for Strategy I (separate preliminary test for each sample) and Strategy II (preliminary test based on residuals), but quantitatively more pronounced for Strategy II than for Strategy I. On the one hand, the results replicated those found for the one-sample t test [ 41 ]. On the other hand, our study revealed interesting new findings: Preliminary testing not only affects the Type I error of the t test on samples from non-normal distributions but also the performance of Mann-Whitney’s U test for equally sized samples from uniform and normal distributions. Since we focused on a two-stage procedure assuming homogenous variances, it can be expected that an additional test for homogeneity of variances should lead to a further distortion of the conditional Type I error rates (e.g., [ 39 , 42 – 45 ]).
Detailed discussion on potential reasons for the detrimental effects of preliminary tests is provided elsewhere [ 30 , 41 , 50 ]; therefore, only a global argument is given here: Exponentially distributed variables follow an exponential distribution, and uniformly distributed variables follow a uniform distribution. This trivial statement holds, regardless of whether a preliminary test for normality is applied to the data or not. A sample or a pair of samples is not normally distributed just because the result of the Shapiro-Wilk test suggests it. From a formal perspective, a sample is a set of fixed ‘realizations’; it is not a random variable which could be said to follow some distribution. The preliminary test cannot alter this basic fact; it can only select samples which appear to be drawn from a normal distribution. If, however, the underlying population is exponential, the preliminary test selects samples that are not representative of the underlying population. Of course, the Type I error rates of hypotheses tests are strongly altered if they are based on unrepresentative samples. Similarly, if the underlying distribution is normal, the pretest will filter out samples that do not appear normal with probability α pre . These latter samples are again not representative for the underlying population, so that the Type I error of the subsequent nonparametric test will be equally affected.
In general, the problem is that the distribution of the test statistic of the test of interest depends on the outcome of the pretest. More precisely, errors occurring at the preliminary stage change the distribution of the test statistic at the second stage [ 38 ]. As can be seen in Tables 1 and 2 , the distortion of the Type I error observed for Strategy I and II is based on at least two different mechanisms. The first mechanism is related to the power of the Shapiro-Wilk test: For the exponential distribution, Strategy I considerably affects the t test, but Strategy II does so even more. As both tables show, distortion of the Type I error, if present, is most pronounced in large samples. In line with this result, Strategy II alters the conditional Type I error to a greater extent than Strategy I, probably because in Strategy II, the pretest is applied to the collapsed set of residuals, that is, the pretest is based on a sample twice the size of that used in Strategy I.
To illustrate the second mechanism, asymmetry, we consider the interesting special case of Strategy I applied to samples from uniform distribution. In Strategy I, Mann-Whitney’s U test was chosen if the pretest for normality failed in at least one sample. Large violations of the nominal significance level of Mann-Whitney’s U test were observed for small samples and numerically low significance levels for the pretest (23.3% for α pre = .005 and n = 10). At α pre = .005 and n = 10, the Shapiro-Wilk test has low power, so that only samples with extreme properties will be identified. In general, however, samples from the uniform distribution do not have extreme properties, such that, in most cases, only one member of the sample pair will be sufficiently extreme to be detected by the Shapiro-Wilk test. Consequently, pairs of samples are selected by the preliminary test for which one member is extreme and the other member is representative; the main significance test will then indicate that the samples differ indeed. For these pairs of samples, the Shapiro-Wilk test and the Mann–Whitney U test essentially yield the same result because they test similar hypotheses. In contrast, in Strategy II, the pretest selected pairs of samples for which the set of residuals (i.e., the two samples shifted over each other) appeared non-normal. This result mostly corresponds to the standard situation in nonparametric statistics, so that the conditional Type I error rate of Mann-Whitney’s U test applied to samples from uniform distribution was unaffected by the asymmetry mechanism.
On the one hand, our study showed that conditional Type I error rates may heavily deviate from the nominal significance level (Tables 1 and 2 ). On the other hand, direct assessment of the unconditional Type I error rate (Table 3 ) and power (Table 4 ) of the two-stage procedure suggests that the two-stage procedure as a whole has acceptable statistical properties. What might be the reason for this discrepancy? To assess the consequences of preliminary tests for the entire two-stage procedure, the power of the pretest needs to be taken into account,
with P(Type I error | Pretest n.s. ) denoting the conditional Type I error rate of the t test (Tables 1 and 2 left), P(Type I error | Pretest sig. ) denoting the conditional Type I error rate of the U test (Tables 1 and 2 right), and P(Pretest sig. ) and P(Pretest n.s. ) denoting the power and 1 – power of the pretest for normality. In Strategy I, P(Pretest sig. ) corresponds to the probability to reject normality for at least one of the two samples, whereas in Strategy II, it is the probability to reject the assumption of normality of the residuals from both samples.
For the t test, unacceptable rates of false decisions due to selection effects of the preliminary Shapiro-Wilk test occur for large samples and numerically high significance levels α pre (e.g., left column in Table 2 ). In these settings, however, the Shapiro-Wilk test detects deviations from normality with nearly 100% power, so that the Student’s t test is practically never used. Instead, the nonparametric test is used that seems to protect the Type I error for those samples. This pattern of results holds for both Strategy I and Strategy II. Conversely, it was demonstrated above that Mann-Whitney’s U test is biased for normally distributed data if the sample size is low and the preliminary significance level is strict (e.g., α pre = .005, right columns of Tables 1 or 2 ). For samples from normal distribution, however, deviation from normality is only rarely detected at α pre = .005, so that the consequences for the overall Type I error of the entire two-stage procedure are again very limited.
A similar argument holds for statistical power: For a given alternative, the overall power of the two-stage procedure corresponds, by construction, to the weighted sum of the conditional power of the t test and U test. When populations deviate only slightly from normality, the pretest for normality has low power, and the power of the two-stage procedure will tend towards the unconditional power of Student’s t test; this fact only does not hold in those rare cases in which the preliminary test indicates non-normality, so that the slightly less powerful Mann–Whitney U test is applied. When the populations deviate considerably from normality, the power of the Shapiro-Wilk test is high for both strategies, and the overall power of the two-stage procedure will tend towards the unconditional power of Mann-Whitney’s U test.
Finally, it should be emphasized that the conditional Type I error rates shown in Tables 1 and 2 correspond to the rather unlikely scenario in which researchers would continue sampling until the assumptions are met. In contrast, the unconditional Type I error and power of the two-stage procedure are most relevant because in practice, researchers do not continue sampling until they obtain normality. Researchers who do not know in advance whether the underlying population distribution is normal, usually base their decision on the samples obtained. If by chance a sample from a non-normal distribution happens to look normal, the researcher could falsely assume that the normality assumption holds. However, this chance is rather low because of the high power of the Shapiro-Wilk test, particularly for larger sample sizes.
From a formal perspective, preliminary testing for normality is incorrect and should therefore be avoided. Normality has to be established for the populations under consideration; if this is not possible, “support for the assumption of normality must come from extra-data sources” ([ 30 ], p. 7). For example, when planning a study, assumptions may be based on the results of earlier trials [ 21 ] or pilot studies [ 36 ]. Although often limited in size, pilot studies could serve to identify substantial deviations from normality. From a practical perspective, however, preliminary testing does not seem to cause much harm, at least for the cases we have investigated. The worst that can be said is that preliminary testing is unnecessary: For large samples, the t test has been shown to be robust in many situations [ 51 – 55 ] (see also Tables 1 and 2 of the present paper) and for small samples, the Shapiro-Wilk test lacks power to detect deviations from normality. If the application of the t test is doubtful, the unconditional use of nonparametric tests seems to be the best choice [ 56 ].
Altman DG: Statistics in medical journals. Stat Med. 1982, 1: 59-71. 10.1002/sim.4780010109.
Article CAS PubMed Google Scholar
Altman DG: Statistics in medical journals: Developments in the 1980s. Stat Med. 1991, 10: 1897-1913. 10.1002/sim.4780101206.
Altman DG: Statistics in medical journals: Some recent trends. Stat Med. 2000, 19: 3275-3289. 10.1002/1097-0258(20001215)19:23<3275::AID-SIM626>3.0.CO;2-M.
Glantz SA: Biostatistics: How to detect, correct and prevent errors in medical literature. Circulation. 1980, 61: 1-7. 10.1161/01.CIR.61.1.1.
Pocock SJ, Hughes MD, Lee RJ: Statistical problems in the reporting of clinical trials—A survey of three medical journals. N Engl J Med. 1987, 317: 426-432. 10.1056/NEJM198708133170706.
Altman DG: Poor-quality medical research: What can journals do?. JAMA. 2002, 287: 2765-2767. 10.1001/jama.287.21.2765.
Article PubMed Google Scholar
Strasak AM, Zaman Q, Marinell G, Pfeiffer KP, Ulmer H: The use of statistics in medical research: A comparison of The New England Journal of Medicine and Nature Medicine. Am Stat. 2007, 61: 47-55. 10.1198/000313007X170242.
Article Google Scholar
Fernandes-Taylor S, Hyun JH, Reeder RN, Harris AHS: Common statistical and research design problems in manuscripts submitted to high-impact medical journals. BMC Res Notes. 2011, 4: 304-10.1186/1756-0500-4-304.
Article PubMed PubMed Central Google Scholar
Olsen CH: Review of the use of statistics in Infection and Immunity. Infect Immun. 2003, 71: 6689-6692. 10.1128/IAI.71.12.6689-6692.2003.
Article CAS PubMed PubMed Central Google Scholar
Neville JA, Lang W, Fleischer AB: Errors in the Archives of Dermatology and the Journal of the American Academy of Dermatology from January through December 2003. Arch Dermatol. 2006, 142: 737-740. 10.1001/archderm.142.6.737.
Altman DG: Practical Statistics for Medical Research. 1991, Chapman and Hall, London
Google Scholar
Cressie N: Relaxing assumptions in the one sample t-test. Aust J Stat. 1980, 22: 143-153. 10.1111/j.1467-842X.1980.tb01161.x.
Ernst MD: Permutation methods: A basis for exact inference. Stat Sci. 2004, 19: 676-685. 10.1214/088342304000000396.
Wilcox RR: How many discoveries have been lost by ignoring modern statistical methods?. Am Psychol. 1998, 53: 300-314.
Micceri T: The unicorn, the normal curve, and other improbable creatures. Psychol Bull. 1989, 105: 156-166.
Kühnast C, Neuhäuser M: A note on the use of the non-parametric Wilcoxon-Mann–Whitney test in the analysis of medical studies. Ger Med Sci. 2008, 6: 2-5.
New England Journal of Medicine: Guidelines for manuscript submission. (Retrieved from http://www.nejm.org/page/author-center/manuscript-submission ); 2011
Altman DG, Gore SM, Gardner MJ, Pocock SJ: Statistics guidelines for contributors to medical journals. Br Med J. 1983, 286: 1489-1493. 10.1136/bmj.286.6376.1489.
Article CAS Google Scholar
Moher D, Schulz KF, Altman DG for the CONSORT Group: The CONSORT statement: Revised recommendations for improving the quality of reports of parallel-group randomized trials. Ann Intern Med. 2001, 134: 657-662.
Vickers AJ: Parametric versus non-parametric statistics in the analysis of randomized trials with non-normally distributed data. BMC Med Res Meth. 2005, 5: 35-10.1186/1471-2288-5-35.
ICH E9: Statistical principles for clinical trials. 1998, International Conference on Harmonisation, London, UK
Gebski VJ, Keech AC: Statistical methods in clinical trials. Med J Aust. 2003, 178: 182-184.
PubMed Google Scholar
Livingston EH: Who was Student and why do we care so much about his t-test?. J Surg Res. 2004, 118: 58-65. 10.1016/j.jss.2004.02.003.
Shuster J: Diagnostics for assumptions in moderate to large simple trials: do they really help?. Stat Med. 2005, 24: 2431-2438. 10.1002/sim.2175.
Meredith WM, Frederiksen CH, McLaughlin DH: Statistics and data analysis. Annu Rev Psychol. 1974, 25: 453-505. 10.1146/annurev.ps.25.020174.002321.
Bancroft TA: On biases in estimation due to the use of preliminary tests of significance. Ann Math Statist. 1944, 15: 190-204. 10.1214/aoms/1177731284.
Paull AE: On a preliminary test for pooling mean squares in the analysis of variance. Ann Math Statist. 1950, 21: 539-556. 10.1214/aoms/1177729750.
Gurland J, McCullough RS: Testing equality of means after a preliminary test of equality of variances. Biometrika. 1962, 49: 403-417.
Freidlin B, Miao W, Gastwirth JL: On the use of the Shapiro-Wilk test in two-stage adaptive inference for paired data from moderate to very heavy tailed distributions. Biom J. 2003, 45: 887-900. 10.1002/bimj.200390056.
Easterling RG, Anderson HE: The effect of preliminary normality goodness of fit tests on subsequent inference. J Stat Comput Simul. 1978, 8: 1-11. 10.1080/00949657808810243.
Pappas PA, DePuy V: An overview of non-parametric tests in SAS: When, why and how. Proceeding of the. SouthEast SAS Users Group Conference (SESUG 2004): Paper TU04. 2004, Miami, FL, SouthEast SAS Users Group, 1-5.
Bogaty P, Dumont S, O’Hara G, Boyer L, Auclair L, Jobin J, Boudreault J: Randomized trial of a noninvasive strategy to reduce hospital stay for patients with low-risk myocardial infarction. J Am Coll Cardiol. 2001, 37: 1289-1296. 10.1016/S0735-1097(01)01131-7.
Holman AJ, Myers RR: A randomized, double-blind, placebo-controlled trial of pramipexole, a dopamine agonist, in patients with fibromyalgia receiving concomitant medications. Arthritis Rheum. 2005, 53: 2495-2505.
Lawson ML, Kirk S, Mitchell T, Chen MK, Loux TJ, Daniels SR, Harmon CM, Clements RH, Garcia VF, Inge TH: One-year outcomes of Roux-en-Y gastric bypass for morbidly obese adolescents: a multicenter study from the Pediatric Bariatric Study Group. J Pediatr Surg. 2006, 41: 137-143. 10.1016/j.jpedsurg.2005.10.017.
Norager CB, Jensen MB, Madsen MR, Qvist N, Laurberg S: Effect of darbepoetin alfa on physical function in patients undergoing surgery for colorectal cancer. Oncology. 2006, 71: 212-220. 10.1159/000106071.
Shuster J: Student t-tests for potentially abnormal data. Stat Med. 2009, 28: 2170-2184. 10.1002/sim.3581.
Schoder V, Himmelmann A, Wilhelm KP: Preliminary testing for normality: Some statistical aspects of a common concept. Clin Exp Dermatol. 2006, 31: 757-761. 10.1111/j.1365-2230.2006.02206.x.
Wells CS, Hintze JM: Dealing with assumptions underlying statistical tests. Psychol Sch. 2007, 44: 495-502. 10.1002/pits.20241.
Rasch D, Kubinger KD, Moder K: The two-sample t test: pretesting its assumptions does not pay. Stat Papers. 2011, 52: 219-231. 10.1007/s00362-009-0224-x.
Zimmerman DW: A simple and effective decision rule for choosing a significance test to protect against non-normality. Br J Math Stat Psychol. 2011, 64: 388-409. 10.1348/000711010X524739.
Schucany WR, Ng HKT: Preliminary goodness-of-fit tests for normality do not validate the one-sample student t. Commun Stat Theory Methods. 2006, 35: 2275-2286. 10.1080/03610920600853308.
Zimmerman DW: Some properties on preliminary tests of equality of variances in the two-sample location problem. J Gen Psychol. 1996, 123: 217-231. 10.1080/00221309.1996.9921274.
Zimmerman DW: Invalidation of parametric and nonparametric statistical tests by concurrent violation of two assumptions. J Exp Educ. 1998, 67: 55-68. 10.1080/00220979809598344.
Zimmerman DW: Conditional probabilities of rejecting H0 by pooled and separate-variances t tests given heterogeneity of sample variances. Commun Stat Simul Comput. 2004, 33: 69-81. 10.1081/SAC-120028434.
Zimmerman DW: A note on preliminary tests of equality of variances. Br J Math Stat Psychol. 2004, 57: 173-181. 10.1348/000711004849222.
R Development Core Team: R: A language and environment for statistical computing. 2011, R Foundation for Statistical Computing, Vienna, Austria
Lee AFS: Student’s t statistics. Encyclopedia of Biostatistics. Edited by: Armitage P, Colton T. 2005, Wiley, New York, 2
Rosner B: Fundamentals of Biostatistics. 1990, PWS-Kent, Boston, 3
ICH E3: Structure and content of clinical study reports. 1995, International Conference on Harmonisation, London, UK
Rochon J, Kieser M: A closer look at the effect of preliminary goodness-of-fit testing for normality for the one-sample t-test. Br J Math Stat Psychol. 2011, 64: 410-426. 10.1348/2044-8317.002003.
Armitage P, Berry G, Matthews JNS: Statistical Methods in Medical Research. 2002, Blackwell, Malden, MA
Book Google Scholar
Boneau CA: The effects of violations underlying the t test. Psychol Bull. 1960, 57: 49-64.
Box GEP: Non-normality and tests of variances. Biometrika. 1953, 40: 318-335.
Rasch D, Guiard V: The robustness of parametric statistical methods. Psychology Science. 2004, 46: 175-208.
Sullivan LM, D’Agostino RB: Robustness of the t test applied to data distorted from normality by floor effects. J Dent Res. 1992, 71: 1938-1943. 10.1177/00220345920710121601.
Akritas MG, Arnold SF, Brunner E: Nonparametric hypotheses and rank statistics for unbalanced factorial designs. J Am Stat Assoc. 1997, 92: 258-265. 10.1080/01621459.1997.10473623.
The pre-publication history for this paper can be accessed here: http://www.biomedcentral.com/1471-2288/12/81/prepub
Download references
Authors and affiliations.
Institute of Medical Biometry and Informatics, University of Heidelberg, Im Neuenheimer Feld 305, 69120, Heidelberg, Germany
Justine Rochon, Matthias Gondan & Meinhard Kieser
You can also search for this author in PubMed Google Scholar
Correspondence to Justine Rochon .
Competing interests.
The authors declare that they have no competing interests.
All authors jointly designed the study. JR carried out the simulations, MG assisted in the simulations and creation of the figures. JR and MG drafted the manuscript. MK planned the study and finalized the manuscript. All authors read and approved the final manuscript.
Additional file 1 : two-stage procedure including a preliminary test for normality. (pdf 72 kb), authors’ original submitted files for images.
Below are the links to the authors’ original submitted files for images.
Authors’ original file for figure 2, rights and permissions.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License( http://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Reprints and permissions
Cite this article.
Rochon, J., Gondan, M. & Kieser, M. To test or not to test: Preliminary assessment of normality when comparing two independent samples. BMC Med Res Methodol 12 , 81 (2012). https://doi.org/10.1186/1471-2288-12-81
Download citation
Received : 27 December 2011
Accepted : 31 May 2012
Published : 19 June 2012
DOI : https://doi.org/10.1186/1471-2288-12-81
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
ISSN: 1471-2288
Do you need support in running a pricing or product study? We can help you with agile consumer research and conjoint analysis.
Conjointly offers a great survey tool with multiple question types, randomisation blocks, and multilingual support. The Basic tier is always free.
Fully-functional online survey tool with various question types, logic, randomisation, and reporting for unlimited number of surveys.
Completely free for academics and students .
The t-test assesses whether the means of two groups are statistically different from each other. This analysis is appropriate whenever you want to compare the means of two groups, and especially appropriate as the analysis for the posttest-only two-group randomized experimental design .
Figure 1 shows the distributions for the treated (blue) and control (green) groups in a study. Actually, the figure shows the idealized distribution – the actual distribution would usually be depicted with a histogram or bar graph . The figure indicates where the control and treatment group means are located. The question the t-test addresses is whether the means are statistically different.
What does it mean to say that the averages for two groups are statistically different? Consider the three situations shown in Figure 2. The first thing to notice about the three situations is that the difference between the means is the same in all three . But, you should also notice that the three situations don’t look the same – they tell very different stories. The top example shows a case with moderate variability of scores within each group. The second situation shows the high variability case. the third shows the case with low variability. Clearly, we would conclude that the two groups appear most different or distinct in the bottom or low-variability case. Why? Because there is relatively little overlap between the two bell-shaped curves. In the high variability case, the group difference appears least striking because the two bell-shaped distributions overlap so much.
This leads us to a very important conclusion: when we are looking at the differences between scores for two groups, we have to judge the difference between their means relative to the spread or variability of their scores. The t-test does just this.
The formula for the t-test is a ratio. The top part of the ratio is just the difference between the two means or averages. The bottom part is a measure of the variability or dispersion of the scores. This formula is essentially another example of the signal-to-noise metaphor in research: the difference between the means is the signal that, in this case, we think our program or treatment introduced into the data; the bottom part of the formula is a measure of variability that is essentially noise that may make it harder to see the group difference. Figure 3 shows the formula for the t-test and how the numerator and denominator are related to the distributions.
The top part of the formula is easy to compute – just find the difference between the means. The bottom part is called the standard error of the difference . To compute it, we take the variance for each group and divide it by the number of people in that group. We add these two values and then take their square root. The specific formula for the standard error of the difference between the means is:
Remember, that the variance is simply the square of the standard deviation .
The final formula for the t-test is:
The t -value will be positive if the first mean is larger than the second and negative if it is smaller. Once you compute the t -value you have to look it up in a table of significance to test whether the ratio is large enough to say that the difference between the groups is not likely to have been a chance finding. To test the significance, you need to set a risk level (called the alpha level ). In most social research, the “rule of thumb” is to set the alpha level at .05 . This means that five times out of a hundred you would find a statistically significant difference between the means even if there was none (i.e. by “chance”). You also need to determine the degrees of freedom (df) for the test. In the t-test , the degrees of freedom is the sum of the persons in both groups minus 2 . Given the alpha level, the df, and the t -value, you can look the t -value up in a standard table of significance (available as an appendix in the back of most statistics texts) to determine whether the t -value is large enough to be significant. If it is, you can conclude that the difference between the means for the two groups is different (even given the variability). Fortunately, statistical computer programs routinely print the significance test results and save you the trouble of looking them up in a table.
The t-test, one-way Analysis of Variance (ANOVA) and a form of regression analysis are mathematically equivalent (see the statistical analysis of the posttest-only randomized experimental design ) and would yield identical results.
Conjointly uses essential cookies to make our site work. We also use additional cookies in order to understand the usage of the site, gather audience analytics, and for remarketing purposes.
For more information on Conjointly's use of cookies, please read our Cookie Policy .
I am new to conjointly, i am already using conjointly.
The t-test is any statistical hypothesis test in which the test statistic follows a Student’s t-distribution under the null hypothesis. It can be used to determine if two sets of data are significantly different from each other, and is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known.
T-test uses means and standard deviations of two samples to make a comparison. The formula for T-test is given below:
\begin{array}{l}\qquad t=\frac{\bar{X}_{1}-\bar{X}_{2}}{s_{\bar{\Delta}}} \\ \text { where } \\ \qquad s_{\bar{\Delta}}=\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}} \\ \end{array}
Where, \(\begin{array}{l}\overline{x}\end{array} \) = Mean of first set of values \(\begin{array}{l}\overline{x}_{2}\end{array} \) = Mean of second set of values \(\begin{array}{l}S_{1}\end{array} \) = Standard deviation of first set of values \(\begin{array}{l}S_{2}\end{array} \) = Standard deviation of second set of values \(\begin{array}{l}n_{1}\end{array} \) = Total number of values in first set \(\begin{array}{l}n_{2}\end{array} \) = Total number of values in second set.
The formula for standard deviation is given by:
Where, x = Values given \(\begin{array}{l}\overline{x}\end{array} \) = Mean n = Total number of values.
Question 1: Find the t-test value for the following two sets of values: 7, 2, 9, 8 and 1, 2, 3, 4?
Formula for standard deviation: \(\begin{array}{l}S=\sqrt{\frac{\sum\left(x-\overline{x}\right)^{2}}{n-1}}\end{array} \)
Number of terms in first set: \(\begin{array}{l}n_{1}\end{array} \) = 4
Mean for first set of data: \(\begin{array}{l}\overline{x}_{1}\end{array} \) = 6.5
Construct the following table for standard deviation:
7 | 0.5 | 0.25 |
2 | -4.5 | 20.25 |
9 | 2.5 | 6.25 |
8 | 1.5 | 2.25 |
Standard deviation for the first set of data: S 1 = 3.11
Number of terms in second set: n 2 = 4
1 | -1.5 | 2.25 |
2 | -0.5 | 0.25 |
3 | 0.5 | 0.25 |
4 | 1.5 | 2.25 |
Standard deviation for first set of data: \(\begin{array}{l}S_{2}\end{array} \) = 1.29
Formula for t-test value:
t = 2.3764 = 2.36 (approx)
More topics in T Test Formula | |
FORMULAS Related Links | |
Your Mobile number and Email id will not be published. Required fields are marked *
Request OTP on Voice Call
Post My Comment
Register with byju's & watch live videos.
Our websites may use cookies to personalize and enhance your experience. By continuing without changing your cookie settings, you agree to this collection. For more information, please see our University Websites Privacy Notice .
Neag School of Education
An introduction to statistics usually covers t tests, anovas, and chi-square. for this course we will concentrate on t tests, although background information will be provided on anovas and chi-square., a powerpoint presentation on t tests has been created for your use..
The t test is one type of inferential statistics. It is used to determine whether there is a significant difference between the means of two groups. With all inferential statistics, we assume the dependent variable fits a normal distribution . When we assume a normal distribution exists, we can identify the probability of a particular outcome. We specify the level of probability (alpha level, level of significance, p ) we are willing to accept before we collect data ( p < .05 is a common value that is used). After we collect data we calculate a test statistic with a formula. We compare our test statistic with a critical value found on a table to see if our results fall within the acceptable level of probability. Modern computer programs calculate the test statistic for us and also provide the exact probability of obtaining that test statistic with the number of subjects we have.
When the difference between two population averages is being investigated, a t test is used. In other words, a t test is used when we wish to compare two means (the scores must be measured on an interval or ratio measurement scale ). We would use a t test if we wished to compare the reading achievement of boys and girls. With a t test, we have one independent variable and one dependent variable. The independent variable (gender in this case) can only have two levels (male and female). The dependent variable would be reading achievement. If the independent had more than two levels, then we would use a one-way analysis of variance (ANOVA).
The test statistic that a t test produces is a t -value. Conceptually, t -values are an extension of z -scores. In a way, the t -value represents how many standard units the means of the two groups are apart.
With a t tes t, the researcher wants to state with some degree of confidence that the obtained difference between the means of the sample groups is too great to be a chance event and that some difference also exists in the population from which the sample was drawn. In other words, the difference that we might find between the boys’ and girls’ reading achievement in our sample might have occurred by chance, or it might exist in the population. If our t test produces a t -value that results in a probability of .01, we say that the likelihood of getting the difference we found by chance would be 1 in a 100 times. We could say that it is unlikely that our results occurred by chance and the difference we found in the sample probably exists in the populations from which it was drawn.
Assumptions underlying the t test.
This is concerned with the difference between the average scores of a single sample of individuals who are assessed at two different times (such as before treatment and after treatment). It can also compare average scores of samples of individuals who are paired in some way (such as siblings, mothers, daughters, persons who are matched in terms of a particular characteristics).
Note: The F-Max test can be substituted for the Levene test. The t test Excel spreadsheet that I created for our class uses the F -Max.
A bit of history… William Sealy Gosset (1905) first published a t-test. He worked at the Guiness Brewery in Dublin and published under the name Student. The test was called Studen t Test (later shortened to t test).
t tests can be easily computed with the Excel or SPSS computer application. I have created an Excel Spreadsheet that does a very nice job of calculating t values and other pertinent information.
IMAGES
COMMENTS
A t test is a statistical test used to compare the means of two groups. The type of t test you use depends on what you want to find out. ... (a.k.a. the Student's t-test) is shown below. In this formula, ... methodology, or research bias, make sure to check out some of our other articles with explanations and examples. Statistics.
A paired two-sample t-test can be used to capture the dependence of measurements between the two groups. These variations of the student's t-test use observed or collected data to calculate a test statistic, which can then be used to calculate a p-value. Often misinterpreted, the p-value is equal to the probability of collecting data that is at ...
Example question: Calculate a paired t test by hand for the following data: Step 1: Subtract each Y score from each X score. Step 2: Add up all of the values from Step 1 then set this number aside for a moment. Step 3: Square the differences from Step 1. Step 4: Add up all of the squared differences from Step 3.
The t-test is a test in statistics that is used for testing hypotheses regarding the mean of a small sample taken population when the standard deviation of the population is not known. The t-test is used to determine if there is a significant difference between the means of two groups. The t-test is used for hypothesis testing to determine ...
We'll use a two-sample t test to evaluate if the difference between the two group means is statistically significant. The t test output is below. In the output, you can see that the treatment group (Sample 1) has a mean of 109 while the control group's (Sample 2) average is 100. The p-value for the difference between the groups is 0.112.
Student's t-test is a statistical test used to test whether the difference between the response of two groups is statistically significant or not. ... "The T-Test", Research Methods Knowledge Base, conjoint.ly; on YouTube by Mark Thoma This page was last edited on 25 August 2024, at 19:18 (UTC). Text is available under the Creative ...
A t test is a statistical technique used to quantify the difference between the mean (average value) of a variable from up to two samples (datasets). The variable must be numeric. Some examples are height, gross income, and amount of weight lost on a particular diet. A t test tells you if the difference you observe is "surprising" based on ...
8. Use a table of critical t-values (see the one at the back of this document) The critical t-value at the p = .05 significance level, for a two-tailed test, is: 2.262. Our t-value (from the experiment) was: 2.183. In order for this to be significant, it must be LARGER than the critical t-value derived from the table.
Student's t test (t test), analysis of variance (ANOVA), and analysis of covariance (ANCOVA) are statistical methods used in the testing of hypothesis for comparison of means between the groups.The Student's t test is used to compare the means between two groups, whereas ANOVA is used to compare the means among three or more groups. In ANOVA, first gets a common P value.
For example, suppose a researcher wishes to test the hypothesis that a sample of size n = 25 with mean x = 79 and standard deviation s = 10 was drawn at random from a population with mean μ = 75 and unknown standard deviation. Using the formula for the t-statistic, the calculated t equals 2. For a two-sided test at a common level of significance α = 0.05, the critical values from the t ...
Part IV is about reporting t-test results in both text and table formats and concludes with a guide to interpreting confidence intervals. Keywords: Educational research, Significance testing, Statistics, t-test 1. Introduction In 1908 William Sealy Gosset, an Englishman publishing under the pseudonym Student developed the t-test and t distribution.
Aug 5, 2022. 6. Photo by Andrew George on Unsplash. Student's t-tests are commonly used in inferential statistics for testing a hypothesis on the basis of a difference between sample means. However, people often misinterpret the results of t-tests, which leads to false research findings and a lack of reproducibility of studies.
Chapter 6The t-test and Basic Inference PrinciplesThe t-test is used as an examp. e of the basic principles of statistical inference.One of the simplest situations for which we might design an experiment is the case of a nominal two-level expla. atory variable and a quantitative.
In statistical methods, t-test, also known as Student 's t-test, is widely used t o compare groups' means for a particular vari-. able. The test replaces z-test whenever the standar d ...
Student's t-test is a method of testing hypotheses about the mean of a small sample drawn from a normally distributed population when the population standard deviation is unknown.In 1908 William Sealy Gosset, an Englishman publishing under the pseudonym Student, developed the t-test.This article discusses the types of T test and shows a simple way of doing a T test.
A t -test (also known as Student's t -test) is a tool for evaluating the means of one or two populations using hypothesis testing. A t-test may be used to evaluate whether a single group differs from a known value (a one-sample t-test), whether two groups differ from each other (an independent two-sample t-test), or whether there is a ...
Parametric methods refer to a statistical technique in which one defines the probability distribution of probability variables and makes inferences about the parameters of the distribution. In cases in which the probability distribution cannot be defined, nonparametric methods are employed. T tests are a type of parametric method; they can be ...
Student's two-sample t test is generally used for comparing the means of two independent samples, for example, two treatment arms. ... Statistical Methods in Medical Research. 2002, Blackwell, Malden, MA. Book Google Scholar Boneau CA: The effects of violations underlying the t test. Psychol Bull. 1960, 57: 49-64. Article CAS ...
In the t-test, the degrees of freedom is the sum of the persons in both groups minus 2. Given the alpha level, the df, and the t-value, you can look the t-value up in a standard table of significance (available as an appendix in the back of most statistics texts) to determine whether the t-value is large enough to be significant. If it is, you ...
Product: Sage Research Methods Video: Quantitative and Mixed Methods; Type of Content: Tutorial Title: Student's T-Test Publisher: Muhammad Shakil Ahmad Series: Statistics for Clinical Research - A Practical Guide; Publication year: 2022; Online pub date: November 28, 2023
The t-test is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis. It can be used to determine if two sets of data are significantly different from each other, and is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known.
The t test is one type of inferential statistics. It is used to determine whether there is a significant difference between the means of two groups. With all inferential statistics, we assume the dependent variable fits a normal distribution. When we assume a normal distribution exists, we can identify the probability of a particular outcome.