The number of comparisons is going to escalate quickly. If we have four flavors of ice cream, we go from Scenario 1 showing three significant variables in its model outputs to the rest of the scenarios only reporting one...
Is this $p$-hacking?
If you have to ask, it probably is.
Sara Stoudt
Bucknell University
I got asked an amazing question last semester while I was giving a crash-course workshop on statistics (shout out to the Internship Network in the Mathematical Sciences) in the fall. It was one of those questions that I really wanted to dig into, and now I finally am.
The question had to do with $p$-hacking, which is a concept that comes up a lot when thinking about the reproducibility and replicability of statistical results. There has been plenty of discourse about the concept, but the gist is that you keep digging around in the data until you find something significant. By doing this, we run the risk of making false discoveries that wouldn’t be able to be replicated in a different study. We trade long term progress for short term gains.
The context of this question was regression and how to treat a categorical variable with more than two categories as a covariate in a model. Before I reveal the specific question, let’s make sure we’re on the same page with the statistical setting.
We’ll start with a concrete scenario. Suppose I want to model ice cream consumption and one covariate is what flavor is on sale (vanilla, chocolate, or strawberry) and one covariate is the temperature outside. Note: in this fictional world, one and only one ice cream flavor can be on sale at any given time.
Ice cream bowl by Twemoji, CC BY 4.0 Deed; Strawberry by anavrin-stock, CC BY-NC-ND 3.0 Deed; thermometer by HitomiAkane, CC BY-SA 4.0 Deed; vanilla and chocolate are public domain.
Now we could assign different numbers to each sale flavor: 1 for vanilla, 2 for chocolate, and 3 for strawberry. But that feels unsatisfying because it’s not like chocolate being on sale counts for twice as much as vanilla being on sale. What if we break up this one categorical variable into multiple binary categorical variables instead? So we make the so-called “dummy” variable isVanilla and set it equal to 1 if vanilla is on sale or 0 if it isn’t. Then isChocolate is 1 if chocolate is on sale or 0 if it isn’t. That leaves strawberry as what we call the baseline: if isVanilla and isChocolate are both zero, that means we’re looking at a situation where strawberry is on sale.
Now when we fit this model, we’ll get coefficients on both isVanilla and isChocolate, and these will tell us how these flavors being on sale are associated with ice cream consumption in comparison to how strawberry being on sale is associated with ice cream consumption.
If the coefficient on isVanilla is -1.2 and the one on isChocolate is 0.8, that means that if the ice cream flavor that is on sale is vanilla, we expect to see ice cream consumption decrease by 1.2 units on average compared to what ice cream consumption would be if strawberry was on sale, holding all else constant. Phew, that’s a mouthful! Similarly, if the ice cream flavor that is on sale is chocolate, we expect to see ice cream consumption increase by 0.8 units on average compared to what ice cream consumption would be if strawberry was on sale, holding all else constant.
Now, here’s where it gets weird. Each of those coefficients has a level of significance. But what statistical test is being performed? Each coefficient’s inference values actually come from a fancy difference in means test between strawberry and vanilla and strawberry and chocolate respectively.
We’re finally to the part where I can tell you the original question that inspired this blog post. Are you ready? The question was...
Can you $p$-hack by fiddling with the baseline level in a model? We all want our model to be significant, right?!
Now, $p$-hacking is really what statisticians call a “multiple testing” problem. (xkcd explains the issue here too.) One of the things that a $p$-value cut off of 0.05 implies is that the probability of a false positive, you rejecting the null hypothesis when you shouldn’t, is only 5%. That seems pretty unlikely. But what happens if you start doing more tests? The probability that at least one unlikely thing happening across many, many tests, turns out to be, well, not that unlikely anymore. By the time we run ten tests, we already have over a 40% chance of having at least one false positive.
So for instance, suppose, I’m a bit of a chocoholic and I choose chocolate as the baseline.
Is it possible for the significance of the coefficients on isVanilla and isStrawberry to not match the significance of the coefficients on isVanilla and isChocolate in the other model?
Let’s sketch it out! Below we have some hypothetical relationships between the flavors and the response variable, complete with a confidence interval. Each interval represents one of the flavors: vanilla, chocolate, or strawberry. Roughly, if the intervals don’t overlap, the significance test comparing them would turn up significant. In this example, two comparisons are significant, while one isn’t. If the baseline is chosen as in Scenario 1, the model results are going to have two significant coefficients while in the other two scenarios, there will only be one significant coefficient.
Now with three levels of the categorical variable, things can’t get too wild. The difference between reporting one and two significant variables doesn’t sound too sneaky if we’re thinking in the context of $p$-hacking. But the number of comparisons is going to escalate quickly. If we have four flavors of ice cream, we go from Scenario 1 showing three significant variables in its model outputs to the rest of the scenarios only reporting one. That seems a bit misleading.
But what if we had a categorical variable with more levels? For example, the Census breaks income into 9 brackets (for example, see Table A2 here). Can you draw a picture of 9 intervals where a strategic choice of baseline results in an even more dramatic discrepancy in significance output? Similarly, what if we started adding interaction terms between our dummy variables and the temperature variable? There is a lot of room for things to get weird.
So what’s the take-away from this tale? Choosing a baseline often happens by default, but it can have an impact on the results we see. Even though messing around with the baseline may not seem as pernicious as some other more egregious ways of $p$-hacking, thinking about this scenario can make us more aware of the impact that our modeling choices can have.