Note: This is an introduction, and I deliberately avoid getting into technical details. Entire books have been written about the details, and, if you use statistics, you will likely have your own sources.

Statistical significance is used widely – in all the sciences, in education research, in medicine, and, to a lesser extent, in the humanities and even in research in the arts. In nearly all these fields, it is abused – it answers the wrong question. The questions we are really interested in are, nearly always, not about statistical significance, but about effect size.

What is statistical significance? It is related to the widespread practice of null hypothesis significance testing. We first set up what is called a null hypothesis. This is usually that there is nothing going on. Then we collect data. If the data are unlikely to have arisen under the null hypothesis, we say they are statistically significant.

For example, suppose we think that Black people are more likely to have voted for Obama than White people. We first set up a null hypothesis – that the two groups are equally likely to have voted for Obama. Then we collect data, asking a bunch of people whether they are Black or White, and whether they voted for Obama, McCain, or someone else. We can then use a statistical procedure to see how likely these results would be, if the null hypothesis were true. If they are very unlikely, we reject the null, and conclude that the two groups are not equally likely to vote for Obama. Nearly always, we say that if the chance of these results, or more extreme results, is less than 0.05, then we can reject the null.

(This simplifies things quite a bit, but gives a good general picture).

Similar procedures can be followed in all sorts of cases:

Does a drug cure a disease?

Does early education improve later grades?

Is the chance of getting HIV related to where you live?

And so on, and so on.

What’s wrong with this procedure?

The main problem is that it answers the wrong question. We are almost never interested in the question “How likely are these results if the null hypothesis is true”, rather, we are interested in the question: “How big an effect is there?” We are interested in how many more people will get better with the drug than without? How much better are high school grades when kids have had early education than if they have not? How much does HIV prevalence vary with location?

Those questions are answered by looking at measure of effect size, not statistical significance. A measure of effect size gives our best guess at how big an effect there is. Answers such as “People who take this drug are half as likely to die as people who do not” or “Kids who have early education have HS grades that are 10 points better than those who do not” or “People who live in New York city are half as likely to get HIV as people who live in South Africa” (I made all those numbers up – my point is to give you an idea of what the answers would be like).

Once we have answered that question, we should also be interested in how good our guess is. A statement like “People who take this drug are half as likely to die as people who do not, but they could easily be one tenth as likely, or twice as likely” has very different implications from a statement like “People who take this drug are half as likely to die as people who do not, but they might be .4 times as likely or .6 times as likely”. Questions like this are answered with confidence intervals, not tests of significance.

Another problem is that statistical significance depends not just on effect size, but also on sample size. If the sample is very large, almost any effect will be significant, even if it is of no practical importance; if the sample is very small, then no effect will be significant, even if it is very large.

How does this affect you?

Well, for instance, if a drug is tested on (say) 10,000 people, then even a tiny difference will be significant. Suppose there is a drug for curing toenail fungus. The company making it tests it on 10,000 people. They find that it reduces toenail fungus by 1%. This is significant, and they report “Our drug leads to significant reduction in toenail fungus”. But the drug will cost money, and it might have side effects. Would you take a drug if you knew it would reduce toenail fungus by 1% or so? Perhaps not.

Or suppose a company reports that enrolling in their program leads to significantly higher college grades. But it costs $10,000 and requires a considerable investment of time. How much higher are those grades? An increase in GPA of 0.1? Would that be worth it? Or is it an increase of a whole letter grade?

Similar problems occur whenever you see a report of statistical significance without a measure of effect size and a confidence interval around it.

For further reading….. There is a lot of literature on this topic; as a starter, see:

Two books:

The Cult of Statistical Significance by Deirdre McCloskey and Steve Zeliak

and

Regression modeling strategies, by Frank Harrell.

and two articles:

Paul Meehl and the evolution of statistical methods in psychology, by James Steiger in Applied and Preventive Psychology, volume 11; published in 2004.

Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. By Paul Meehl in the Journal of Consulting and Clinical Psychology, volume 46, published in 1978.

And further references cited in those works.