← Back to Master Index

Assumptions of Linear Regression

Understanding:

Linear Regression isn't just about drawing a line. For the model to be statistically valid and trustworthy, the data must follow a set of "Rules" (Assumptions). If these are broken, your predictions or p-values might be garbage.

1. Linear Relationship

The Rule: The relationship between the independent variable (X) and the dependent variable (Y) must be linear. Ideally, the data points should arrange themselves in a straight line shape.
The Violation: The data shows a curve (parabola, exponential, etc.). If you fit a straight line to curved data, your predictions will be systematically wrong.

2. No Autocorrelation of Error

The Rule: The errors (residuals) should be independent of each other. Knowing one error shouldn't help you predict the next one.
The Violation: Common in time-series data. If yesterday's error was positive, and today's is also likely positive, you have autocorrelation.
Check: Use the Durbin-Watson Test (Values 1.5 - 2.5 are generally normal).

3. Homoscedasticity (Equal Variance)

The Rule: The error (noise) should be consistent across all values of X.
The Violation (Heteroscedasticity): The error spreads out like a funnel (e.g., predicting house prices becomes harder/noisier for expensive houses).

Data Plot (X vs Y)
Residual Plot (Predicted vs Error)
Uniform (Good) Funnel (Bad)
Status: Homoscedastic (Pass)

How to diagnose?

Look at the Residual Plot on the right.
• If it looks like a random cloud of blobs → Good.
• If it looks like a cone or fan (getting wider to the right) → Bad.

4. Normality of Residuals

The Rule: The errors (residuals) should follow a Normal (Bell Curve) distribution. centered at zero. This means most of the error values should be close to zero, with fewer and fewer errors as we move further away from the mean.

Histogram of Residuals
Q-Q Plot (Quantile-Quantile)

The Q-Q Plot

This is the pro way to check normality.
Dots on the Red Line: The data fits the theoretical normal distribution perfectly.
Dots curling away: The data is skewed or has heavy tails (outliers).

5. Multicollinearity

The Rule: Independent variables ($x_1, x_2$) should NOT be too correlated with each other. If they are, the model gets confused about which variable is actually contributing to the prediction.

Detection: We use the VIF (Variance Inflation Factor).
• VIF = 1: No correlation (Perfect).
• VIF > 5-10: High multicollinearity (Problematic).

The Math: How is VIF calculated?

For each independent variable ($X_i$), we calculate its VIF using this formula:

$VIF_i = \frac{1}{1 - R_i^2}$

Where $R_i^2$ is the Coefficient of Determination obtained by regressing $X_i$ against all other independent variables.

  • If $X_i$ is highly predictable using other $X$'s, $R_i^2$ will be close to 1.
  • As $R_i^2 \to 1$, the denominator $(1 - R_i^2) \to 0$, causing the VIF to "explode" to a very high value.