Relationship between p value and Confidence

Are there really lies, damned lies and statistics? You don’t need to get bogged down in mathematics to answer that. You just need to understand the logic behind probability tests, p-values and Confidence Intervals.

All conventional statistical tests of probability take the form of an if – then statement.

An if – then statement takes the form

If x is true, then y should happen

The “if” is one of the assumptions underlying the test. All statistical tests of probability rely on assumptions. A common assumption is that the different factors that may influence the result act independently of one another. These assumptions are not always explicitly acknowledged. They tend to be ignored by those who use statistics

“like a drunken man leaning on a lamppost, for support rather than illumination”

The underlying assumptions must be valid, otherwise the statistical test is not. So, you should never believe the results of a statistical test without knowing what assumptions were made in applying it, and checking whether they really apply to the data.

It is technically incorrect to stay that “you can prove anything with statistics”. Statistics don’t lie, people lie. Mostly, it comes down to a question of trust. In fact, you can never prove anything to be true with statistics. You just reach a known level of probability, and even that known level of probability is based on assumptions that always have to be made in designing the hypothesis.

A p-value is the probability that the results we see have arisen by chance. A conventional standard is to accept a p-value of 0.05 as statistically significant. p = 0.05 means there is a 5% chance that an observed association is due to chance. Using such a cut-off means we accept that, 5% of the time, we will conclude that there is an association when there is none. Put more simply, one in twenty research studies claiming a positive association with a p-value of 0.05 will be wrong. Being wrong in this way is known as a Type I error, or a false positive finding.

There may be instances when we wish to be more certain. Choosing a smaller p-value of 0.01 as statistically significant means we will be wrong only one time in a hundred. With a p-value of 0.001 we will be wrong only one time in a thousand, and 0.0001 one in ten thousand. This can go on indefinitely, but we can never be entirely sure that the results have not arisen by chance. The p-value of a lottery jackpot winner in the UK is around 0.00000007 (one in fourteen million), yet it happens by chance most weeks.

The medical literature is full of Type I statistical errors. For example, if a busy medical journal’s output for the year contained 100 papers reporting positive findings at the p=0.05 level, we would expect five of them to be wrong due to Type I error. And we wouldn’t know which were the wrong ones. We might have an idea that something doesn’t sound very plausible. But the only way we could really find out would be when the study was replicated, and the positive finding was not repeated. Even with a pair of positive studies at the p=0.05 level, there is still a chance that we are observing a chance effect. The probability is the product of the two p-values. 0.05 times 0.05 gives 0.0025, a one in four hundred chance. One in four hundred events do happen quite often.

When using the results of trials to help guide clinical practice, it is generally much more useful to know how much of an effect the treatment had. This is known as the Effect Size (ES). The measured effect size comes directly from the trial data and is known. But that measured effect size would not be exactly the same if we repeated the study again. Nor would it have been exactly the same if we happened to have recruited some slightly different patients to those we actually had. We therefore also need an estimate of how accurate the study was in measuring effect size.

The spread of possible values of effect size, up to a certain level of probability, is known as the Confidence Interval (CI). The 95% CI is often used, by convention, and is analogous to the p-value of 0.05. It means that there is a one in twenty chance that the real effect size is greater than the upper limit, or less than the lower limit, of the confidence interval.

A 99% confidence interval means that we would expect only one in a hundred repetitions of the study to give an effect size beyond the limits. Naturally, a 99% confidence interval is wider than a 95% confidence interval for the same data. When we have two sets of observations, we can calculate an effect size and confidence interval at our chosen level. If the selected confidence interval for the effect size includes zero then we cannot say that we have demonstrated a positive effect of treatment.