The Origins of Ordinary Least Squares Assumptions

When we start to think more about it, more questions arise. What makes a line “good”? How do we tell if a line is the “best”?

The Origins of Ordinary Least Squares Assumptions

Some Are More Breakable Than Others

Sara Stoudt
Bucknell University

Three scatterplots signifying residual plots for linear regression. The first shows a nonlinear relationship and a stick figure remarking “Hm… that L in OLS stands for linear right?”. The second shows a fan shape (signifying non-constant variance) with a stick figure exclaiming “The dreaded fan shape!!”. The third shows points not following a line especially on the ends (signifying a lack of normality in a quantile quantile plot) with a stick figure asking “Not so normal are we?”.

Introduction

Fitting a line to a set of points… how hard can it be? When those points represent the temperature outside and a town’s ice cream consumption, I’m really invested in that line helping me to understand the relationship between those two quantities. (What if my favorite flavor runs out?!) I might even want to predict new values of ice cream consumption based on new temperature values. A line can give us a way to do that too. But when we start to think more about it, more questions arise. What makes a line “good”? How do we tell if a line is the “best”?

A technique called ordinary least squares (OLS), aka linear regression, is a principled way to pick the “best” line where “best” is defined as the one that minimizes the sum of the squared distances between the line and each point. We chant the assumptions of OLS and know what to look for in diagnostic plots, but where do these assumptions come from? Are there assumptions that are more hurtful than others if broken? These questions are something I even second guess myself, though I frequently use and teach linear regression.

Bring on the Math

While proving “good” performance of OLS, we need to make certain assumptions to streamline the process. Walking through the proof reveals what assumptions are most crucial and what the impact is of breaking each one.

Let’s first write out our theoretical model in matrix notation and recall what the dimensions of each piece of the puzzle are. We are looking for a linear relationship between $X$ and $Y$, but we know there may be some error, which we represent by $\epsilon$.

The equation Y equals X times Beta + Epsilon with brackets below each variable denoting the size of each variable. Y has n rows and 1 column. X has n rows and p columns. Beta has p rows and 1 column 1. Epsilon has n rows and 1 column.

The OLS approach estimates the unknown parameters $\beta$ by $\hat{\beta} = (X’X)^{-1}X’Y$. Here $X’$ is the transpose of $X$, so the product $X’X$ is a square matrix.

 

The equation beta-hat equals the inverse of parenthesis X transpose multiplied by X close parenthesis multiplied by X transpose times Y. The inverse notation is emphasized and an annotation prompts: “can we invert this?”.

Have we implicitly assumed anything yet? Actually yes, there is an inverse in play! What would $X’X$ have to look like to be invertible? Let’s put our linear algebra hats on.

This matrix would have to be “full rank”. Informally, that means that there is no redundant information in the columns of $X$. So if the number of predictor variables $p$ is greater than the number of data points $n$, that would be a problem. Even if $p < n$, the columns of $X$ still need to be linearly independent. Multicollinearity (when one covariate is highly correlated with another covariate) would make us suspicious of this. Let’s keep that in mind just in case it comes back to haunt us.

Now that we have our estimator for $\beta$, there are two properties we want to check on.

  1. We want to on average get the “right” answer.
  2. We want to be able to describe the uncertainty in our estimate. In other words, how much would our estimate wiggle around if we happened to have a new sample?

Let’s start with determining if we “expect” to get the right answer on average. It will help to rewrite $Y$ based on the proposed model.

We want the expected value of beta-hat conditional on X to be equal to beta. Beta-hat can be rewritten as the inverse of parenthesis X transpose multiplied by X close parenthesis times X transpose times Y. This is equal to the inverse of parenthesis X transpose multiplied by X close parenthesis times X transpose times parenthesis X times beta plus epsilon close parenthesis. This breaks down into parenthesis X transpose multiplied by X close parenthesis times X transpose times X times beta plus parenthesis X transpose multiplied by X close parenthesis times X transpose times epsilon. The first part of the first term in the sum, parenthesis X transpose multiplied by X close parenthesis times X transpose times X, cancels leaving us with beta. Now we just need parenthesis X transpose multiplied by X close parenthesis times X transpose times epsilon to go away. An annotation tells us it is “expectation time!”.

As we break this expression down, we can see that we are on the right track. However, somehow that second term needs to have expectation zero.

The expected value of beta-hat conditional on X is equal to the expected value of parenthesis beta plus parenthesis X transpose X close parenthesis times X transpose times epsilon close parenthesis conditional on X. This is equal to the expectation of beta conditional on X plus the expectation of the inverse of parenthesis X transpose X times X transpose times epsilon conditional on X. Everything but epsilon is fixed and can come out of the expectation. To be left with beta only we need the expected value of epsilon conditional on X to be zero. Is it?

How can we make this happen? This is where two regression assumptions are born. First we need the errors, $\epsilon$, to be independent of $X$. This seems plausible. If the errors depend on $X$, somehow we still have some information leftover that is not accounted for in the model. If the errors did depend on $X$, that would be a form of heteroscedasticity (non-constant variance for short). That sets our OLS assumption alarm bells ringing. We also need to make an assumption about the magnitude of the errors themselves. It would be nice if they were not systematically positive or negative (that doesn’t seem like a very good model), so assuming they are on average zero seems like a reasonable path forward.

Does the expected value of epsilon conditional on X equal zero? Epsilon being independent of X implies that the expected value of epsilon conditional on X is equal to the expected value of epsilon. Then if the expected value of epsilon is assumed to be zero, the expected value of beta-hat conditional on X equals beta as desired.

With these assumptions in hand we now have an estimator that is unbiased. We expect to get the right answer on average. So far we have seen the motivation for two assumptions. If either of those are broken, this unbiasedness is not guaranteed. So if our goal is to make predictions or interpret these coefficients in context, we will be out of luck if these assumptions aren’t met. The next step is to understand the estimator’s uncertainty. Let’s see what other assumptions reveal themselves in the process.

The covariance of beta-hat conditional on X is equal to the inverse of parenthesis X transpose times X close parenthesis times X transpose times the covariance of parenthesis epsilon conditional on X close parenthesis times X times the inverse of parenthesis X transpose times X close parenthesis. If we could unite X transpose and X we could cancel one of their combined inverses. If the covariance of parenthesis epsilon conditional on X close parenthesis was constant, we could shuffle it around. Hence if we assume the epsilon subscript i values are independent (meaning we don’t have to worry about covariances) and identically distributed (meaning the variance is the same for each, call it sigma squared), we are left with sigma squared times inverse of parenthesis X transpose times X close parenthesis. A stick figure laments “Ah! So many Xs!”.

So let’s recap what we’ve had to assume about the errors so far.

Epsilon subscript i is drawn iid from some distribution with mean zero and variance equal to sigma squared. Notably no normal distribution in sight! A cartoon normal distribution asks “am I needed yet?”.

What haven’t we had to assume yet? Normality! Then why is everyone always worried about that when doing regression? Why all of those qqplots?! Well, we do need to know the whole sampling distribution of the estimates if we want to do inference, or say something about a more general population based on a sample. If we assume the distribution that those errors have is actually Normal, that lets us get a normal sampling distribution of $\hat{\beta}$.

A cartoon normal distribution says “I feel left out”. Epsilon subscript i is drawn iid from a normal distribution with mean zero and variance equal to sigma squared implies that beta-hat is drawn from a Normal distribution with mean beta and variance equal to sigma squared multiplied by the inverse of parenthesis X transpose times X close parenthesis. A confidence interval around beta-hat subscript 1 and a distribution centered at beta with an area to the right of beta-hat subscript 1 shaded in is shown.  

Ah, but there is that pesky inverse again! If we have multicollinearity, that inverse might get unstable, affecting our understanding of the spread of the sampling distribution. And of course nothing in life is perfectly normal. However, we can often get an approximately normal sampling distribution if our sample size is large enough thanks to the Central Limit Theorem (an explainer for another time), so there is some robustness to breaking this particular assumption.

We’ve now seen all of the regression assumptions unfold, so we should be able to build a sort of hierarchy of assumptions. 

The hierarchy of modeling assumptions. The most important components are about unbiasedness, namely the mean response is linearly related to X, the errors are independent of X and the errors have mean zero. Next are the assumptions that help us estimate the standard error, namely the errors are independent of one another and the errors have constant variance. Least important are the assumptions that give us the sampling distribution of beta-hat, namely that the errors are normally distributed.

 

This hierarchy is also borne out in simulation studies like this one that try to stress test OLS. You may even want to code up a simple simulation yourself if you want to look for any gaps between the theory and practice. 

Some broken assumptions are fixable though. We might transform variables to deal with a lack of linearity or use a Generalized Linear Model that allows the relationship between $X$ and $Y$ to take a non-linear form. We might just be happy with estimating the best linear projection of the true relationship. If errors aren’t independent and/or do not have constant variance, methods like generalized least squares (GLS) can step in. If errors are not normally distributed, they might be approximately normally distributed thanks to the Central Limit Theorem, or we might want to use a Generalized Linear Model (GLM) that can handle error distributions of other forms. We always have options!

It’s great to see the theory behind the scenes informing the practice, and in general, assumptions of any method have to come from somewhere. Seeking out where in the math assumptions help make our lives easier can help demystify where these assumptions actually come from and gives some insight into which ones, if broken, are more dangerous than others. Happy line fitting!

All the Math in One Place

$Y = X \beta + \epsilon$

$\hat{\beta} = (X’X)^{-1}X’Y$

 

$E[\hat{\beta} | X] = \beta$

$\hat{\beta} = (X’X)^{-1} X’ Y = (X’X)^{-1} X’ (X\beta + \epsilon)$

$= (X’X)^{-1} X’X\beta + (X’X)^{-1} X’\epsilon$

$= \beta + (X’X)^{-1}X’ \epsilon$

 

$E[\hat{\beta} | X] = E[(\beta + (X’X)^{-1} X’ \epsilon) | X]$

$ = E[\beta | X] + E[(X’X)^{-1}X’ \epsilon | X]$

$ = \beta + (X’X)^{-1} X’ E[\epsilon | X]$

 

$E[\epsilon | X] = 0?$

$\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)$

$\hat{\beta} \sim N(\beta, \sigma^2 (X’X)^{-1})$

References

  • Statistical Models: Theory and Practice by David A. Freedman, Cambridge University Press (2009)
  • Mostly Harmless Econometrics: An Empiricist’s Companion by Joshua D. Angrist and Jörn-Steffen Pischke, Princeton University Press (2009)

Special thanks to my students in my Stat 2 class this semester for raising good questions about these regression assumptions and to my colleagues for helping me work through this hierarchy before reporting back. Also, thanks to Features Writer Courtney R. Gibbons for including drawings in her last post. That inspired me to do some doodling myself.

Note: The website https://drawdata.xyz/ made it possible to generate data to make the plots in the opening image.

Leave a Reply

Your email address will not be published. Required fields are marked *

HTML tags are not allowed.

55,609 Spambots Blocked by Simple Comments