{"id":748,"date":"2022-03-01T00:01:50","date_gmt":"2022-03-01T05:01:50","guid":{"rendered":"https:\/\/mathvoices.ams.org\/featurecolumn\/?p=748"},"modified":"2022-03-02T16:55:09","modified_gmt":"2022-03-02T21:55:09","slug":"ordinary-least-squares","status":"publish","type":"post","link":"https:\/\/mathvoices.ams.org\/featurecolumn\/2022\/03\/01\/ordinary-least-squares\/","title":{"rendered":"The Origins of Ordinary Least Squares Assumptions"},"content":{"rendered":"<p><span id=\"pullQuote\"><em>When we start to think more about it, more questions arise. What makes a line \u201cgood\u201d? How do we tell if a line is the \u201cbest\u201d?<\/em> <\/span><\/p>\n<h1>The Origins of Ordinary Least Squares Assumptions<\/h1>\n<h2>Some Are More Breakable Than Others<\/h2>\n<p><strong>Sara Stoudt<\/strong><br \/>\n<strong>Bucknell University<\/strong><\/p>\n<h1 class=\"headlineText\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"wp-image-766 aligncenter\" style=\"font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen-Sans, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif;font-size: 16px\" src=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-17-at-2.04.33-PM.png?resize=620%2C480&#038;ssl=1\" alt=\"Three scatterplots signifying residual plots for linear regression. The first shows a nonlinear relationship and a stick figure remarking \u201cHm\u2026 that L in OLS stands for linear right?\u201d. The second shows a fan shape (signifying non-constant variance) with a stick figure exclaiming \u201cThe dreaded fan shape!!\u201d. The third shows points not following a line especially on the ends (signifying a lack of normality in a quantile quantile plot) with a stick figure asking \u201cNot so normal are we?\u201d. \" width=\"620\" height=\"480\" srcset=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-17-at-2.04.33-PM.png?resize=300%2C232&amp;ssl=1 300w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-17-at-2.04.33-PM.png?resize=768%2C595&amp;ssl=1 768w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-17-at-2.04.33-PM.png?resize=465%2C360&amp;ssl=1 465w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-17-at-2.04.33-PM.png?resize=645%2C500&amp;ssl=1 645w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-17-at-2.04.33-PM.png?w=879&amp;ssl=1 879w\" sizes=\"auto, (max-width: 620px) 100vw, 620px\" \/><\/h1>\n<h3>Introduction<\/h3>\n<p><span style=\"font-weight: 400\">Fitting a line to a set of points\u2026 how hard can it be? When those points represent the temperature outside and a town\u2019s ice cream consumption, I\u2019m really invested in that line helping me to understand the relationship between those two quantities. (What if my favorite flavor runs out?!) I might even want to predict new values of ice cream consumption based on new temperature values. A line can give us a way to do that too. But when we start to think more about it, more questions arise. What makes a line \u201cgood\u201d? How do we tell if a line is the \u201cbest\u201d?<\/span><\/p>\n<p><span style=\"font-weight: 400\">A technique called ordinary least squares (OLS), aka linear regression, is a principled way to pick the \u201cbest\u201d line where \u201cbest\u201d is defined as the one that minimizes the sum of the squared distances between the line and each point. We chant the assumptions of OLS and know what to look for in diagnostic plots, but where do these assumptions come from? Are there assumptions that are more hurtful than others if broken? These questions are something I even second guess myself, though I frequently use and teach linear regression.<\/span><\/p>\n<h3>Bring on the Math<\/h3>\n<p><span style=\"font-weight: 400\">While proving \u201cgood\u201d performance of OLS, we need to make certain assumptions to streamline the process. Walking through the proof reveals what assumptions are most crucial and what the impact is of breaking each one.<\/span><\/p>\n<p><span style=\"font-weight: 400\">Let\u2019s first write out our theoretical model in matrix notation and recall what the dimensions of each piece of the puzzle are. We are looking for a linear relationship between $X$ and $Y$, but we know there may be some error, which we represent by $\\epsilon$.<\/span><\/p>\n<h1 class=\"headlineText\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-756 aligncenter\" src=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.05.58-PM.png?resize=325%2C215&#038;ssl=1\" alt=\"The equation Y equals X times Beta + Epsilon with brackets below each variable denoting the size of each variable. Y has n rows and 1 column. X has n rows and p columns. Beta has p rows and 1 column 1. Epsilon has n rows and 1 column. \" width=\"325\" height=\"215\" srcset=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.05.58-PM.png?resize=300%2C198&amp;ssl=1 300w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.05.58-PM.png?resize=465%2C306&amp;ssl=1 465w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.05.58-PM.png?w=621&amp;ssl=1 621w\" sizes=\"auto, (max-width: 325px) 100vw, 325px\" \/><\/h1>\n<p><span style=\"font-weight: 400\">The OLS approach estimates the unknown parameters $\\beta$ by <\/span><span style=\"font-weight: 400\">$\\hat{\\beta} = (X\u2019X)^{-1}X\u2019Y$. Here $X\u2019$ is the transpose of $X$, so the product $X\u2019X$ is a square matrix.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h1 class=\"headlineText\" style=\"text-align: center\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-757 \" style=\"font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen-Sans, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif;font-size: 16px\" src=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.06.02-PM.png?resize=403%2C144&#038;ssl=1\" alt=\"The equation beta-hat equals the inverse of parenthesis X transpose multiplied by X close parenthesis multiplied by X transpose times Y. The inverse notation is emphasized and an annotation prompts: \u201ccan we invert this?\u201d.\" width=\"403\" height=\"144\" srcset=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.06.02-PM.png?resize=300%2C107&amp;ssl=1 300w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.06.02-PM.png?resize=465%2C166&amp;ssl=1 465w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.06.02-PM.png?w=623&amp;ssl=1 623w\" sizes=\"auto, (max-width: 403px) 100vw, 403px\" \/><\/h1>\n<p><span style=\"font-weight: 400\">Have we implicitly assumed anything yet? Actually yes, there is an inverse in play! What would $X\u2019X$ have to look like to be invertible? Let\u2019s put our linear algebra hats on.<\/span><\/p>\n<p><span style=\"font-weight: 400\">This matrix would have to be \u201cfull rank\u201d. Informally, that means that there is no redundant information in the columns of $X$. So if the number of predictor variables $p$ is greater than the number of data points $n$, that would be a problem. Even if $p &lt; n$, the columns of $X$ still need to be linearly independent. Multicollinearity (when one covariate is highly correlated with another covariate) would make us suspicious of this. Let\u2019s keep that in mind just in case it comes back to haunt us.<\/span><\/p>\n<p><span style=\"font-weight: 400\">Now that we have our estimator for $\\beta$, there are two properties we want to check on.<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">We want to on average get the \u201cright\u201d answer.<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">We want to be able to describe the uncertainty in our estimate. In other words, how much would our estimate wiggle around if we happened to have a new sample?<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400\">Let\u2019s start with determining if we \u201cexpect\u201d to get the right answer on average. It will help to rewrite $Y$ based on the proposed model.<\/span><\/p>\n<h1 class=\"headlineText\" style=\"text-align: center\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-758 \" style=\"font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen-Sans, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif;font-size: 16px\" src=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.06.12-PM.png?resize=671%2C486&#038;ssl=1\" alt=\"We want the expected value of beta-hat conditional on X to be equal to beta. Beta-hat can be rewritten as the inverse of parenthesis X transpose multiplied by X close parenthesis times X transpose times Y. This is equal to the inverse of parenthesis X transpose multiplied by X close parenthesis times X transpose times parenthesis X times beta plus epsilon close parenthesis. This breaks down into parenthesis X transpose multiplied by X close parenthesis times X transpose times X times beta plus parenthesis X transpose multiplied by X close parenthesis times X transpose times epsilon. The first part of the first term in the sum, parenthesis X transpose multiplied by X close parenthesis times X transpose times X, cancels leaving us with beta. Now we just need parenthesis X transpose multiplied by X close parenthesis times X transpose times epsilon to go away. An annotation tells us it is \u201cexpectation time!\u201d.\" width=\"671\" height=\"486\" srcset=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.06.12-PM.png?resize=300%2C217&amp;ssl=1 300w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.06.12-PM.png?resize=768%2C555&amp;ssl=1 768w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.06.12-PM.png?resize=465%2C336&amp;ssl=1 465w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.06.12-PM.png?resize=690%2C500&amp;ssl=1 690w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.06.12-PM.png?resize=691%2C500&amp;ssl=1 691w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.06.12-PM.png?w=954&amp;ssl=1 954w\" sizes=\"auto, (max-width: 671px) 100vw, 671px\" \/><\/h1>\n<p><span style=\"font-weight: 400\">As we break this expression down, we can see that we are on the right track. However, somehow that second term needs to have expectation zero.<\/span><\/p>\n<h1 class=\"headlineText\" style=\"text-align: center\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-759 \" style=\"font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen-Sans, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif;font-size: 16px\" src=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.06.21-PM.png?resize=656%2C496&#038;ssl=1\" alt=\"The expected value of beta-hat conditional on X is equal to the expected value of parenthesis beta plus parenthesis X transpose X close parenthesis times X transpose times epsilon close parenthesis conditional on X. This is equal to the expectation of beta conditional on X plus the expectation of the inverse of parenthesis X transpose X times X transpose times epsilon conditional on X. Everything but epsilon is fixed and can come out of the expectation. To be left with beta only we need the expected value of epsilon conditional on X to be zero. Is it?\" width=\"656\" height=\"496\" srcset=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.06.21-PM.png?resize=300%2C227&amp;ssl=1 300w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.06.21-PM.png?resize=768%2C582&amp;ssl=1 768w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.06.21-PM.png?resize=465%2C353&amp;ssl=1 465w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.06.21-PM.png?resize=660%2C500&amp;ssl=1 660w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.06.21-PM.png?w=893&amp;ssl=1 893w\" sizes=\"auto, (max-width: 656px) 100vw, 656px\" \/><\/h1>\n<p><span style=\"font-weight: 400\">How can we make this happen? This is where two regression assumptions are born. First we need the errors, $\\epsilon$, to be independent of $X$. This seems plausible. If the errors depend on $X$, somehow we still have some information leftover that is not accounted for in the model. If the errors <em>did<\/em> depend on $X$, that would be a form of heteroscedasticity (non-constant variance for short). That sets our OLS assumption alarm bells ringing. We also need to make an assumption about the magnitude of the errors themselves. It would be nice if they were not systematically positive or negative (that doesn\u2019t seem like a very good model), so assuming they are on average zero seems like a reasonable path forward.<\/span><\/p>\n<h1 class=\"headlineText\" style=\"text-align: center\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-760 \" style=\"font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen-Sans, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif;font-size: 16px\" src=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.13.50-PM.png?resize=527%2C390&#038;ssl=1\" alt=\"Does the expected value of epsilon conditional on X equal zero? Epsilon being independent of X implies that the expected value of epsilon conditional on X is equal to the expected value of epsilon. Then if the expected value of epsilon is assumed to be zero, the expected value of beta-hat conditional on X equals beta as desired.\" width=\"527\" height=\"390\" srcset=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.13.50-PM.png?resize=300%2C222&amp;ssl=1 300w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.13.50-PM.png?resize=768%2C569&amp;ssl=1 768w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.13.50-PM.png?resize=465%2C344&amp;ssl=1 465w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.13.50-PM.png?resize=675%2C500&amp;ssl=1 675w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-2.13.50-PM.png?w=791&amp;ssl=1 791w\" sizes=\"auto, (max-width: 527px) 100vw, 527px\" \/><\/h1>\n<p><span style=\"font-weight: 400\">With these assumptions in hand we now have an estimator that is <em>unbiased<\/em>. We expect to get the right answer on average. So far we have seen the motivation for two assumptions. If either of those are broken, this unbiasedness is not guaranteed. So if our goal is to make predictions or interpret these coefficients in context, we will be out of luck if these assumptions aren\u2019t met. The next step is to understand the estimator\u2019s uncertainty. Let\u2019s see what other assumptions reveal themselves in the process.<\/span><\/p>\n<h1 class=\"headlineText\" style=\"text-align: center\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-761 \" style=\"font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen-Sans, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif;font-size: 16px\" src=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.37.14-PM.png?resize=561%2C398&#038;ssl=1\" alt=\"The covariance of beta-hat conditional on X is equal to the inverse of parenthesis X transpose times X close parenthesis times X transpose times the covariance of parenthesis epsilon conditional on X close parenthesis times X times the inverse of parenthesis X transpose times X close parenthesis. If we could unite X transpose and X we could cancel one of their combined inverses. If the covariance of parenthesis epsilon conditional on X close parenthesis was constant, we could shuffle it around. Hence if we assume the epsilon subscript i values are independent (meaning we don\u2019t have to worry about covariances) and identically distributed (meaning the variance is the same for each, call it sigma squared), we are left with sigma squared times inverse of parenthesis X transpose times X close parenthesis. A stick figure laments \u201cAh! So many Xs!\u201d.\" width=\"561\" height=\"398\" srcset=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.37.14-PM.png?resize=300%2C213&amp;ssl=1 300w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.37.14-PM.png?resize=1024%2C727&amp;ssl=1 1024w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.37.14-PM.png?resize=768%2C546&amp;ssl=1 768w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.37.14-PM.png?resize=465%2C330&amp;ssl=1 465w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.37.14-PM.png?resize=695%2C494&amp;ssl=1 695w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.37.14-PM.png?w=1067&amp;ssl=1 1067w\" sizes=\"auto, (max-width: 561px) 100vw, 561px\" \/><\/h1>\n<p><span style=\"font-weight: 400\">So let\u2019s recap what we\u2019ve had to assume about the errors so far.<\/span><\/p>\n<h1 class=\"headlineText\" style=\"text-align: center\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-762 \" style=\"font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen-Sans, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif;font-size: 16px\" src=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.37.31-PM.png?resize=603%2C169&#038;ssl=1\" alt=\"Epsilon subscript i is drawn iid from some distribution with mean zero and variance equal to sigma squared. Notably no normal distribution in sight! A cartoon normal distribution asks \u201cam I needed yet?\u201d.\" width=\"603\" height=\"169\" srcset=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.37.31-PM.png?resize=300%2C84&amp;ssl=1 300w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.37.31-PM.png?resize=1024%2C287&amp;ssl=1 1024w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.37.31-PM.png?resize=768%2C215&amp;ssl=1 768w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.37.31-PM.png?resize=465%2C130&amp;ssl=1 465w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.37.31-PM.png?resize=695%2C195&amp;ssl=1 695w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.37.31-PM.png?w=1113&amp;ssl=1 1113w\" sizes=\"auto, (max-width: 603px) 100vw, 603px\" \/><\/h1>\n<p><span style=\"font-weight: 400\">What <em>haven\u2019t<\/em> we had to assume yet? Normality! Then why is everyone always worried about that when doing regression? Why all of those qqplots?! Well, we do need to know the whole sampling distribution of the estimates if we want to do inference, or say something about a more general population based on a sample. If we assume the distribution that those errors have is actually Normal, that lets us get a normal sampling distribution of $\\hat{\\beta}$.<\/span><\/p>\n<h1 class=\"headlineText\" style=\"text-align: center\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-763 \" style=\"font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen-Sans, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif;font-size: 16px\" src=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.37.46-PM.png?resize=567%2C210&#038;ssl=1\" alt=\"A cartoon normal distribution says \u201cI feel left out\u201d. Epsilon subscript i is drawn iid from a normal distribution with mean zero and variance equal to sigma squared implies that beta-hat is drawn from a Normal distribution with mean beta and variance equal to sigma squared multiplied by the inverse of parenthesis X transpose times X close parenthesis. A confidence interval around beta-hat subscript 1 and a distribution centered at beta with an area to the right of beta-hat subscript 1 shaded in is shown.\" width=\"567\" height=\"210\" srcset=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.37.46-PM.png?resize=300%2C111&amp;ssl=1 300w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.37.46-PM.png?resize=1024%2C379&amp;ssl=1 1024w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.37.46-PM.png?resize=768%2C284&amp;ssl=1 768w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.37.46-PM.png?resize=465%2C172&amp;ssl=1 465w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.37.46-PM.png?resize=695%2C257&amp;ssl=1 695w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.37.46-PM.png?w=1055&amp;ssl=1 1055w\" sizes=\"auto, (max-width: 567px) 100vw, 567px\" \/> <span style=\"font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen-Sans, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif;font-size: 16px\">\u00a0<\/span><\/h1>\n<p><span style=\"font-weight: 400\">Ah, but there is that pesky inverse again! If we have multicollinearity, that inverse might get unstable, affecting our understanding of the spread of the sampling distribution. And of course nothing in life is perfectly normal. However, we can often get an approximately normal sampling distribution if our sample size is large enough thanks to the Central Limit Theorem (an explainer for another time), so there is some robustness to breaking this particular assumption.<\/span><\/p>\n<p><span style=\"font-weight: 400\">We\u2019ve now seen all of the regression assumptions unfold, so we should be able to build a sort of hierarchy of assumptions.\u00a0<\/span><\/p>\n<h1 class=\"headlineText\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-765 aligncenter\" style=\"font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen-Sans, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif;font-size: 16px\" src=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.55.59-PM.png?resize=658%2C373&#038;ssl=1\" alt=\"The hierarchy of modeling assumptions. The most important components are about unbiasedness, namely the mean response is linearly related to X, the errors are independent of X and the errors have mean zero. Next are the assumptions that help us estimate the standard error, namely the errors are independent of one another and the errors have constant variance. Least important are the assumptions that give us the sampling distribution of beta-hat, namely that the errors are normally distributed. \" width=\"658\" height=\"373\" srcset=\"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.55.59-PM.png?resize=300%2C170&amp;ssl=1 300w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.55.59-PM.png?resize=1024%2C579&amp;ssl=1 1024w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.55.59-PM.png?resize=768%2C434&amp;ssl=1 768w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.55.59-PM.png?resize=465%2C263&amp;ssl=1 465w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.55.59-PM.png?resize=695%2C393&amp;ssl=1 695w, https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/Screen-Shot-2022-02-03-at-3.55.59-PM.png?w=1155&amp;ssl=1 1155w\" sizes=\"auto, (max-width: 658px) 100vw, 658px\" \/><\/h1>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400\">This hierarchy is also borne out in simulation studies like <\/span><a href=\"https:\/\/doi.org\/10.3758\/s13428-021-01587-5\"><span style=\"font-weight: 400\">this one<\/span><\/a><span style=\"font-weight: 400\"> that try to stress test OLS. You may even want to code up a simple simulation yourself if you want to look for any gaps between the theory and practice.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">Some broken assumptions are fixable though. We might transform variables to deal with a lack of linearity or use a Generalized Linear Model that allows the relationship between $X$ and $Y$ to take a non-linear form. We might just be happy with estimating the best linear projection of the true relationship. If errors aren\u2019t independent and\/or do not have constant variance, methods like generalized least squares (GLS) can step in. If errors are not normally distributed, they might be approximately normally distributed thanks to the Central Limit Theorem, or we might want to use a Generalized Linear Model (GLM) that can handle error distributions of other forms. We always have options!<\/span><\/p>\n<p><span style=\"font-weight: 400\">It\u2019s great to see the theory behind the scenes informing the practice, and in general, assumptions of any method have to come from somewhere. Seeking out where in the math assumptions help make our lives easier can help demystify where these assumptions actually come from and gives some insight into which ones, if broken, are more dangerous than others. Happy line fitting!<\/span><\/p>\n<h3>All the Math in One Place<\/h3>\n<p><span style=\"font-weight: 400\">$Y = X \\beta + \\epsilon$ <\/span><\/p>\n<p><span style=\"font-weight: 400\">$\\hat{\\beta} = (X\u2019X)^{-1}X\u2019Y$<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400\">$E[\\hat{\\beta} | X] = \\beta$<\/span><\/p>\n<p><span style=\"font-weight: 400\">$\\hat{\\beta} = (X\u2019X)^{-1} X\u2019 Y = (X\u2019X)^{-1} X\u2019 (X\\beta + \\epsilon)$<\/span><\/p>\n<p><span style=\"font-weight: 400\">$= (X\u2019X)^{-1} X\u2019X\\beta + (X\u2019X)^{-1} X\u2019\\epsilon$<\/span><\/p>\n<p><span style=\"font-weight: 400\">$= \\beta + (X\u2019X)^{-1}X\u2019 \\epsilon$<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400\">$E[\\hat{\\beta} | X] = E[(\\beta + (X\u2019X)^{-1} X\u2019 \\epsilon) | X]$<\/span><\/p>\n<p><span style=\"font-weight: 400\">$ = E[\\beta | X] + E[(X\u2019X)^{-1}X\u2019 \\epsilon | X]$<\/span><\/p>\n<p><span style=\"font-weight: 400\">$ = \\beta + (X\u2019X)^{-1} X\u2019 E[\\epsilon | X]$<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400\">$E[\\epsilon | X] = 0?$<\/span><\/p>\n<p>$\\epsilon_i \\stackrel{iid}{\\sim} N(0, \\sigma^2)$<\/p>\n<p>$\\hat{\\beta} \\sim N(\\beta, \\sigma^2 (X&#8217;X)^{-1})$<\/p>\n<h3>References<\/h3>\n<ul>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Statistical Models: Theory and Practice by David A. Freedman, <\/span><i><span style=\"font-weight: 400\">Cambridge University Press<\/span><\/i><span style=\"font-weight: 400\"> (2009)<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Mostly Harmless Econometrics: An Empiricist\u2019s Companion by Joshua D. Angrist and J<\/span><span style=\"font-weight: 400\">\u00f6<\/span><span style=\"font-weight: 400\">rn-Steffen Pischke, <\/span><i><span style=\"font-weight: 400\">Princeton University Press<\/span><\/i><span style=\"font-weight: 400\"> (2009)<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400\">Special thanks to my students in my Stat 2 class this semester for raising good questions about these regression assumptions and to my colleagues for helping me work through this hierarchy before reporting back. Also, thanks to Features Writer <\/span><span style=\"font-weight: 400\">Courtney R. Gibbons for including drawings in her last post. That inspired me to do some doodling myself.<\/span><\/p>\n<p><span style=\"font-weight: 400\">Note: The website <\/span><a href=\"https:\/\/drawdata.xyz\/\"><span style=\"font-weight: 400\">https:\/\/drawdata.xyz\/<\/span><\/a><span style=\"font-weight: 400\"> made it possible to generate data to make the plots in the opening image.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>When we start to think more about it, more questions arise. What makes a line \u201cgood\u201d? How do we tell if a line is the \u201cbest\u201d? The Origins of Ordinary Least Squares Assumptions Some Are More Breakable Than Others Sara Stoudt Bucknell University Introduction Fitting a line to a set<span class=\"more-link\"><a href=\"https:\/\/mathvoices.ams.org\/featurecolumn\/2022\/03\/01\/ordinary-least-squares\/\">Read More &rarr;<\/a><\/span><\/p>\n","protected":false},"author":2,"featured_media":793,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"advanced_seo_description":"","jetpack_seo_html_title":"","jetpack_seo_noindex":false,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[4,25,77],"tags":[79,78],"class_list":["entry","author-uwhitcher","post-748","post","type-post","status-publish","format-standard","has-post-thumbnail","category-4","category-probability-and-statistics","category-sara-stoudt","tag-linear-regression","tag-ordinary-least-squares"],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/mathvoices.ams.org\/featurecolumn\/wp-content\/uploads\/sites\/2\/2022\/03\/cropped-cropped-mathvoices-banner-feat-col-1.png?fit=1200%2C279&ssl=1","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/mathvoices.ams.org\/featurecolumn\/wp-json\/wp\/v2\/posts\/748","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mathvoices.ams.org\/featurecolumn\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mathvoices.ams.org\/featurecolumn\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mathvoices.ams.org\/featurecolumn\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mathvoices.ams.org\/featurecolumn\/wp-json\/wp\/v2\/comments?post=748"}],"version-history":[{"count":23,"href":"https:\/\/mathvoices.ams.org\/featurecolumn\/wp-json\/wp\/v2\/posts\/748\/revisions"}],"predecessor-version":[{"id":880,"href":"https:\/\/mathvoices.ams.org\/featurecolumn\/wp-json\/wp\/v2\/posts\/748\/revisions\/880"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mathvoices.ams.org\/featurecolumn\/wp-json\/wp\/v2\/media\/793"}],"wp:attachment":[{"href":"https:\/\/mathvoices.ams.org\/featurecolumn\/wp-json\/wp\/v2\/media?parent=748"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mathvoices.ams.org\/featurecolumn\/wp-json\/wp\/v2\/categories?post=748"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mathvoices.ams.org\/featurecolumn\/wp-json\/wp\/v2\/tags?post=748"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}