Confounding bias

The mythical “P-value”

Statistical and clinical significance of confounders in RCTs

cropped-screen-shot-2018-09-22-at-12-00-29-am1.pngOne thing I tell my students is that thinking hard is hard — but necessary. Of course we would all prefer to make decisions the easy way — with simple heuristics or decision rules. Unfortunately, when we fail to grapple with the nuances in each new decision task, we increase our risk for error.

Statistical and clinical significance with confounders

In the past it was traditional for authors of randomized controlled trials (RCTs) to report “p-values” in tables that compared baseline characteristics between two treatment groups. However, the practice of reporting p-values is declining for reasons that are largely philosophical.1,2 Nonetheless, it is important for readers of RCTs to examine the differences between two groups at baseline because of concerns about confounding bias. Readers with a background in science might prefer to rely on p-values to decide whether any differences are “statistically significant” — because they may feel it will help them to infer whether the differences are “clinically significant.” However, such a requirement could overlook 2 important considerations:

  1. The nuances of what is implied by the p-value in the context of differences in baseline characteristics
  2. How to interpret the “clinical significance” of large but non-statistically significant differences in key baseline characteristics

The p-value in the context of differences in baseline characteristics

Scientists calculate a p-value to help them think about the meaning of differences observed in experimental or pseudo-experimental studies. The interpretation of the p-value is straightforward when the context is inferential hypothesis testing. When the context is comparing differences between two randomly sampled groups, as in the case of RCTs, there is a little bit more nuance to consider.

In a nutshell, the p-value is the estimated probability that researchers could randomly sample subjects with differences at least as large as what was observed just by chance. For example, consider the figure, which shows a table of baseline characteristics for escitalopram compared to placebo in an RCT by Kim and colleagues.3 In blue, I have written in some of the risk ratios (RRs) and p-values for some of the more pronounced differences. For the percentage of men in the placebo group compared to escitalopram, the RR is 1.04, which means there are 4% more men in the placebo group. The p-value represents the estimated probability that a difference of this magnitude could occur just by chance. It’s 0.654, which means that if investigators were to randomly sample two groups of the same size (about 150) from the same population repeatedly, they would get two samples where one group had 4% (or more) men compared to the other group 65% of the time. Now consider the “rented accommodation” characteristic, which is an indicator of socioeconomic status (SES). In this case, the placebo group has 60% more renters than the escitalopram group, which has a smaller probability of occurring just by chance (5.9%). This means that a difference this large (or larger) would only occur once in every 17 samples.

Figure. A few “baseline characteristics” reported by study authors for escitalopram compared to placebo. SD is for standard deviation, an indicator of variance. RR is for risk ratio, which is the relative proportion difference in the placebo group compared to escitalopram. The “p-value” is the probability of randomly sampling two groups with differences at least this large.

How do we interpret this? In a straightforward hypothesis test, such as when we are comparing the two groups to see if escitalopram was likely the reason that there was a lower risk of dying, we would deem the finding to be “statistically significant” if the p-value is small enough that it seems unlikely to be caused by chance. By convention, we have all sort of agreed that a statistically meaningful threshold is 5%. This means that, if the estimated probability that we would see a difference as large as this one is smaller than 1 in 20, we would conclude that the difference was not due to chance — and if the only thing that varied between the two groups was the treatment, than we would conclude that the treatment likely caused the difference. (If the converse was true — if the p-value was greater than 5%, we would not conclude that the difference was due to chance; rather we would say that the test was inconclusive. That’s because of something called assay sensitivity.)

But how do we interpret the p-value in the case of baseline characteristics? What does it mean to estimate the probability that we have sampled these two groups just by chance when we already know that the two groups have been sampled by chance — using randomization? If we used p-values for these comparisons, then whenever it was less than 0.05, we would conclude that these differences were not due to chance. However, this would be an erroneous conclusion — because they were randomly sampled. (Unless, of course, there was allocation bias, attrition bias, investigator incompetence, or fraud.) Obviously the 5% threshold is an arbitrary threshold; we accept it by convention. It’s just a dogmatic rule. But, the truth is that we all know that things that have a 5% probability happen by chance every day, all over the world! That’s why it’s important to engage our brain and think about the hypothesis we are testing in each case.

Personally, I think it would be helpful to have a hypothesis test to help us consider whether one of the biases listed above is a likely explanation of the differences observed in RCTs. However, such a test would not tell us anything about how important the differences are. The importance of such differences has to be interpreted in the context of clinical significance.

The clinical significance of differences in baseline characteristics

Although none of the p-values in the figure is less than 0.05, that doesn’t mean that these differences are unimportant. It’s possible that the only reason these differences are not significant is because the sample size was small; with larger samples, differences of this magnitude would certainly have been significant. The fact is that social determinants of health have a powerful impact on outcomes like heart disease, cancer, and death.4 The social determinants of health don’t exert a smaller effect on health when their prevalence is smaller; social determinants of health don’t care one whit about prevalence or statistical significance. What matters is how a man’s health is impacted by his living alone in poverty. Differences in these characteristics of the magnitudes observed here are almost certainly going to be important, especially over follow-up times as long as 10-plus years.


  1. Senn S. Testing for baseline balance in clinical trials. Statistics in Medicine. 1994;13(17):1715-26.
  2. Austin PC, Manca A, Zwarenstein M, et al. A substantial and confusing variation exists in handling of baseline trials: A review of trials published in leading medical journals. Journal of Clinical Epidemiology. 2010;63(2):142-53.
  3. Kim J-M, Steward R, Lee Y-S, et al. Effect of escitalopram vs placebo treatment for depression on long-term cardiac outcomes in patients with acute coronary syndrome: A randomized clinical trial. JAMA. 2018;230(4):350-7.
  4. Shrank WH, Patrick AR, Brookhart AM. Healthy user and related biases in observational studies of preventive interventions: A primer for physicians. Journal of General Internal Medicine. 2010;26(5):546-50.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s