Linear Regression Assumptions

5 min readJun 26, 2021

For the simplicity of the article, I decided to make most of my examples on a simple regression model (one independent variable and the target variable). However, they can be applied to multiple linear regression models, and indeed can be expanded to other forms of general linear models with a single target variable ANOVA, ANCOVA, and independent samples t-tests.

To have a better model in this regard consistency and efficiency play a vital role, consider our estimation method as Ordinary Least Squares (OLS) as is usually the case.

Consistency: According to the Central limit theorem when we want to estimate a parameter derived from sample data, it tends to improve as the sample size grows larger. Hence, we can call our estimator unbiased if the mean sample size and the population is the same.

Efficiency: refers to how accurate our estimates are, in other words, the more accurate our estimates the more efficient our model will be.

Repeated sampling of a normally distributed population tends to be normal and we can calculate our confidence intervals and p-values (significance tests). The validity of this will guarantee that we will have normally distributed model errors.

The Normality Assumption

We spend too much time checking the normality because it is all about the errors. When the normality assumption has been satisfied, we may assume that errors are normally distributed for any combination of values on the predictor variables.

We can say that regression is highly robust to the assumption of normally distributed errors because even if errors are not normally distributed in our data the coefficients will tackle a normal distribution as the sample size grows larger.

The linearity assumption

If we consider our linear regression model as Y = B0 + B1X1 + B2X2 + … + BnXn, our target variable Y is supposed to have a linear relationship with all predictors (X1, X2, …, Xn); this means that the target variable is assumed to be a linear function of the predictors (B1, B2, …, Bn), but not necessarily a linear function of the predictor variables. For instance, predictors can be (X^2, X^3, …) and we can still say that we have a linear model.

Assumptions about the model errors

The following four assumptions apply to errors rather than the target or independent variables, it is not possible to investigate these assumptions without estimating the actual regression model.

It is a common misconception that assumption checking can and should be fully completed before model estimations. Assumption checking should be an ongoing process throughout any data analysis.

1. Zero conditional means of errors

The errors are assumed to have a mean of zero for any combination of values of predictor variables. If this assumption is violated, regression coefficients may be biased and it may also lead to unmodeled non-linearity. For example, if the model specifies a linear relationship between the predictor and the response and the true relationship is non-linear.

2. Independence of errors

The errors are assumed to be independent. Breach of this assumption results in biased estimates of standard errors and significance, though the estimates of the regression coefficients remain unbiased, yet inefficient. Earlier we assumed that our data are sampled randomly and formed a normal distribution. The use of cluster rather than random sampling can result in dependence of errors. Based on this assumption the analysis of nested data may require the use of a multilevel model.

3. Homoscedasticity of errors

Our residuals (model errors) must be constant across all levels of the predictor variables. This assumption is also known as the homogeneity of variance assumption. If the residuals have a variance that is finite but not constant across different levels of the predictors, heteroscedasticity is present. OLS estimates will be unbiased and consistent as long as the errors are independent, but will not be efficient.

4. Normal distribution of errors

This assumption is required for trustworthy significance tests and confidence intervals in small samples, in other words, the larger the sample, the lesser the importance of this assumption. A normal Q-Q plot may be useful to check for the normality of the distribution of errors.

1. Multicollinearity

The presence of correlations between the predictors is termed collinearity (for a relationship between two predictor variables) or multicollinearity (for relationships between more than two predictors). If there is a perfect correlation between two or more predictions, we can say that no unique least-squares solution to a regression analysis can be computed. On the other hand, less severe multicollinearity can lead to unstable estimates of the coefficients for individual predictors. The variance inflation factor is one popular measure of multicollinearity. Appropriate responses to multicollinearity may include the use of an alternative estimation method such as ridge regression, or principal components regression. Removing some of the highly correlated predictors may be considered too, but this solution is usually not ideal

2. Outliers

In some cases, the results of regression analysis may be strongly influenced by individual members of the sample that have highly unusual values on one or more variables under analysis, or a highly unusual combination of values. This is not necessarily a problem in itself, nor necessarily a justification for excluding such cases. When outliers are excluded, it may be useful to present results both with and without outlier exclusions.