{"id":1927,"date":"2024-03-01T00:01:22","date_gmt":"2024-03-01T05:01:22","guid":{"rendered":"https:\/\/mathvoices.ams.org\/featurecolumn\/?p=1927"},"modified":"2024-07-15T12:01:14","modified_gmt":"2024-07-15T16:01:14","slug":"is-this-p-hacking","status":"publish","type":"post","link":"https:\/\/mathvoices.ams.org\/featurecolumn\/2024\/03\/01\/is-this-p-hacking\/","title":{"rendered":"Is this $p$-hacking?"},"content":{"rendered":"<p><span id=\"pullQuote\"><em>The number of comparisons is going to escalate quickly. If we have four flavors of ice cream, we go from Scenario 1 showing three significant variables in its model outputs to the rest of the scenarios only reporting one&#8230;<\/em><\/span><\/p>\n<h1 class=\"headlineText\">Is this $p$-hacking?<\/h1>\n<h2 class=\"headlineText\">If you have to ask, it probably is.<\/h2>\n<p><strong>Sara Stoudt<\/strong><br \/>\n<strong>Bucknell University<\/strong><\/p>\n<p>I got asked an amazing question last semester while I was giving a crash-course workshop on statistics (shout out to the <a href=\"https:\/\/inmas.us\/\">Internship Network in the Mathematical Sciences<\/a>) in the fall. It was one of those questions that I really wanted to dig into, and now I finally am.<\/p>\n<p>The question had to do with $p$-hacking, which is a concept that comes up a lot when thinking about the reproducibility and replicability of statistical results. There has been <a href=\"https:\/\/www.tandfonline.com\/toc\/utas20\/73\/sup1\">plenty of<\/a> <a href=\"https:\/\/projects.fivethirtyeight.com\/p-hacking\/\">discourse<\/a> about the concept, but the gist is that you keep digging around in the data until you find something significant. By doing this, we run the risk of making false discoveries that wouldn\u2019t be able to be replicated in a different study. We trade long term progress for short term gains.<\/p>\n<p>The context of this question was regression and how to treat a categorical variable with more than two categories as a covariate in a model. Before I reveal the specific question, let\u2019s make sure we\u2019re on the same page with the statistical setting.<\/p>\n<p>We\u2019ll start with a concrete scenario. Suppose I want to model ice cream consumption and one covariate is what flavor is on sale (vanilla, chocolate, or strawberry) and one covariate is the temperature outside. Note: in this fictional world, one and only one ice cream flavor can be on sale at any given time.<\/p>\n<div align=\"center\">\n<img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2024\/03\/mod1.png?w=600&#038;ssl=1\" alt=\"Linear model notation for ice cream consumption as a function of sale flavor and outside temperature. Each of the three variables have icons that illustrate: a bowl of ice cream with many scoops, individual scoops of three flavors, and a thermometer.\"  \/><br \/>\n<br \/>\nIce cream bowl by Twemoji, <a href=\"https:\/\/creativecommons.org\/licenses\/by\/4.0\/\">CC BY 4.0 Deed<\/a>; Strawberry by anavrin-stock, <a href=\"https:\/\/creativecommons.org\/licenses\/by-nc-nd\/3.0\/\">CC BY-NC-ND 3.0 Deed<\/a>; thermometer by HitomiAkane, <a href=\"https:\/\/creativecommons.org\/licenses\/by-sa\/4.0\/deed.en\">CC BY-SA 4.0 Deed<\/a>; vanilla and chocolate are public domain.\n<\/div>\n<p>Now we could assign different numbers to each sale flavor: 1 for vanilla, 2 for chocolate, and 3 for strawberry. But that feels unsatisfying because it\u2019s not like chocolate being on sale counts for twice as much as vanilla being on sale. What if we break up this one categorical variable into multiple <em>binary<\/em> categorical variables instead? So we make the so-called \u201cdummy\u201d variable isVanilla and set it equal to 1 if vanilla is on sale or 0 if it isn\u2019t. Then isChocolate is 1 if chocolate is on sale or 0 if it isn\u2019t. That leaves strawberry as what we call the baseline: if isVanilla and isChocolate are both zero, that means we\u2019re looking at a situation where strawberry is on sale.<\/p>\n<div align=\"center\">\n<img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2024\/03\/mod2.png?w=600&#038;ssl=1\" alt=\"Linear model notation for ice cream consumption as a function of isVanilla and isChocolate as dummy variables and outside temperature. Each of the variables have icons that illustrate: a bowl of ice cream with many scoops, individual scoops of the two sale flavors, and a thermometer.\"  \/><\/div>\n<p>Now when we fit this model, we\u2019ll get coefficients on both isVanilla and isChocolate, and these will tell us how these flavors being on sale are associated with ice cream consumption <em>in comparison to<\/em> how strawberry being on sale is associated with ice cream consumption.<\/p>\n<p>If the coefficient on isVanilla is -1.2 and the one on isChocolate is 0.8, that means that if the ice cream flavor that is on sale is vanilla, we expect to see ice cream consumption decrease by 1.2 units on average compared to what ice cream consumption would be if strawberry was on sale, holding all else constant. Phew, that\u2019s a mouthful! Similarly, if the ice cream flavor that is on sale is chocolate, we expect to see ice cream consumption increase by 0.8 units on average compared to what ice cream consumption would be if strawberry was on sale, holding all else constant. <\/p>\n<p>Now, here\u2019s where it gets weird. Each of those coefficients has a level of significance. But what statistical test is being performed? Each coefficient\u2019s inference values actually come from a fancy difference in means test between strawberry and vanilla and strawberry and chocolate respectively. <\/p>\n<p>We\u2019re finally to the part where I can tell you the original question that inspired this blog post. Are you ready? The question was&#8230;<\/p>\n<p>Can you $p$-hack by fiddling with the baseline level in a model? We all want our model to be significant, right?!<\/p>\n<p>Now, $p$-hacking is really what statisticians call a \u201cmultiple testing\u201d problem. (xkcd explains the issue <a href=\"https:\/\/www.explainxkcd.com\/wiki\/index.php\/882:_Significant\">here<\/a> too.) One of the things that a $p$-value cut off of 0.05 implies is that the probability of a false positive, you rejecting the null hypothesis when you shouldn\u2019t, is only 5%. That seems pretty unlikely. But what happens if you start doing more tests? The probability that at least one unlikely thing happening across many, many tests, turns out to be, well, not that unlikely anymore. By the time we run ten tests, we already have over a 40% chance of having at least one false positive. <\/p>\n<p><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2024\/03\/multTest.png?w=600&#038;ssl=1\" alt=\"A graph with number of tests on the x-axis and the probability of at least one false positive on the y-axis. The probability sharply increases, and by 100 tests, the graph flattens out at a probability of 1. The title of the plot is \u201cUh oh! Multiple Testing is at it again.\u201d\"  \/><\/p>\n<p>So for instance, suppose, I\u2019m a bit of a chocoholic and I choose chocolate as the baseline.<\/p>\n<div align=\"center\">\n<img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2024\/03\/mod3.png?w=600&#038;ssl=1\" alt=\"Linear model notation for ice cream consumption as a function of isVanilla and isStrawberry as dummy variables and outside temperature. Each of the variables have icons that illustrate: a bowl of ice cream with many scoops, individual scoops of the two sale flavors, and a thermometer.\"  \/>\n<\/div>\n<p>Is it possible for the significance of the coefficients on isVanilla and isStrawberry to not match the significance of the coefficients on isVanilla and isChocolate in the other model?<\/p>\n<p>Let\u2019s sketch it out! Below we have some hypothetical relationships between the flavors and the response variable, complete with a confidence interval. Each interval represents one of the flavors: vanilla, chocolate, or strawberry. Roughly, if the intervals don\u2019t overlap, the significance test comparing them would turn up significant. In this example, two comparisons are significant, while one isn\u2019t. If the baseline is chosen as in Scenario 1, the model results are going to have two significant coefficients while in the other two scenarios, there will only be one significant coefficient. <\/p>\n<p><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2024\/03\/intervalFig1.png?w=600&#038;ssl=1\" alt=\"Three confidence intervals where two overlap and one doesn\u2019t. There are three scenarios where each interval takes a turn at representing the baseline. Color coded lines represent what pairwise relationships are significant\/not significant and which would appear in the model\u2019s results table.\"  \/><\/p>\n<p>Now with three levels of the categorical variable, things can\u2019t get too wild. The difference between reporting one and two significant variables doesn\u2019t sound too sneaky if we\u2019re thinking in the context of $p$-hacking. But the number of comparisons is going to escalate quickly. If we have four flavors of ice cream, we go from Scenario 1 showing three significant variables in its model outputs to the rest of the scenarios only reporting one. That seems a bit misleading.<\/p>\n<p><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2024\/03\/intervalFig2.png?w=600&#038;ssl=1\" alt=\"Four confidence intervals where three overlap and one doesn\u2019t. There are four scenarios where each interval takes a turn at representing the baseline. Color coded lines represent what pairwise relationships are significant\/not significant and which would appear in the model\u2019s results table.\"  \/><\/p>\n<p>But what if we had a categorical variable with more levels? For example, the Census breaks income into 9 brackets (for example, see Table A2 <a href=\"https:\/\/www.census.gov\/content\/dam\/Census\/library\/publications\/2023\/demo\/p60-279.pdf\">here<\/a>). Can you draw a picture of 9 intervals where a strategic choice of baseline results in an even more dramatic discrepancy in significance output? Similarly, what if we started adding interaction terms between our dummy variables and the temperature variable? There is a lot of room for things to get weird.<\/p>\n<p>So what\u2019s the take-away from this tale? Choosing a baseline often happens by default, but it can have an impact on the results we see. Even though messing around with the baseline may not seem as pernicious as some other more egregious ways of $p$-hacking, thinking about this scenario can make us more aware of the impact that our modeling choices can have. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>The number of comparisons is going to escalate quickly. If we have four flavors of ice cream, we go from Scenario 1 showing three significant variables in its model outputs to the rest of the scenarios only reporting one&#8230; Is this $p$-hacking? If you have to ask, it probably is.<span class=\"more-link\"><a href=\"https:\/\/mathvoices.ams.org\/featurecolumn\/2024\/03\/01\/is-this-p-hacking\/\">Read More &rarr;<\/a><\/span><\/p>\n","protected":false},"author":2,"featured_media":2037,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"advanced_seo_description":"","jetpack_seo_html_title":"","jetpack_seo_noindex":false,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[147,25,77],"tags":[152,153],"class_list":["entry","author-uwhitcher","post-1927","post","type-post","status-publish","format-standard","has-post-thumbnail","category-147","category-probability-and-statistics","category-sara-stoudt","tag-p-hacking","tag-statistics"],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2024\/07\/cropped-FC1380x500x2.png?fit=1380%2C288&ssl=1","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/mathvoices.ams.org\/featurecolumn\/wp-json\/wp\/v2\/posts\/1927","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mathvoices.ams.org\/featurecolumn\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mathvoices.ams.org\/featurecolumn\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mathvoices.ams.org\/featurecolumn\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mathvoices.ams.org\/featurecolumn\/wp-json\/wp\/v2\/comments?post=1927"}],"version-history":[{"count":10,"href":"https:\/\/mathvoices.ams.org\/featurecolumn\/wp-json\/wp\/v2\/posts\/1927\/revisions"}],"predecessor-version":[{"id":1944,"href":"https:\/\/mathvoices.ams.org\/featurecolumn\/wp-json\/wp\/v2\/posts\/1927\/revisions\/1944"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mathvoices.ams.org\/featurecolumn\/wp-json\/wp\/v2\/media\/2037"}],"wp:attachment":[{"href":"https:\/\/mathvoices.ams.org\/featurecolumn\/wp-json\/wp\/v2\/media?parent=1927"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mathvoices.ams.org\/featurecolumn\/wp-json\/wp\/v2\/categories?post=1927"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mathvoices.ams.org\/featurecolumn\/wp-json\/wp\/v2\/tags?post=1927"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}