← Back to Master Index

Regularization: Shrinking the Weights

When a model has too many features or the weights ($\theta$) become too large, it starts to "memorize" the training data instead of "learning" it. This is Overfitting.

Regularization fixes this by changing the Cost Function to punish large weights.

$$ \text{New Cost} = \text{MSE} + \text{Penalty Term} $$

1. Ridge Regression ( L2 Regularization )

Ridge adds a penalty equal to the square of the magnitude of coefficients.

$$ J(\theta) = \text{MSE} + \color{#f472b6}{\alpha \sum_{j=1}^{n} \theta_j^2} $$

The Logic: If a weight ($\theta$) tries to become very large, the Cost Function "explodes." To keep the Cost low, Gradient Descent is forced to keep the weights small and evenly distributed.

Result: It reduces the impact of less important features but never makes them zero.

2. Lasso Regression ( L1 Regularization )

Lasso ( Least Absolute Shrinkage and Selection Operator ) adds a penalty equal to the absolute value of the magnitude of coefficients.

$$ J(\theta) = \text{MSE} + \color{#f472b6}{\alpha \sum_{j=1}^{n} |\theta_j|} $$

The Logic: Because Lasso uses absolute values (the sharp "V" shape), it has a unique property: it can force weights to become exactly zero.

Result: It performs Feature Selection. It effectively removes useless variables from your model.

3. What is $\alpha$ ( Alpha )?

$\alpha$ is the Regularization Strength.

$\alpha = 0$: Standard Linear Regression (No penalty).
$\alpha = \infty$: Weights become zero (Model is too simple).
Sweet Spot: Find an alpha that balances accuracy and simplicity.

Use Ridge when...

You have many features that all contribute a little bit to the result. It handles Multicollinearity (correlated features) very well.

Use Lasso when...

You suspect only a few features are actually important. It will help you "clean" your data by ignoring the noise.