Wait, didn't we already talk about MSE in the Optimization section? Why do we need "Metrics" separately?
The Compass: Used during training. It must be differentiable (smooth) so Gradient Descent can follow it.
"Which direction should the weights move?"
The Scoreboard: Used after training. It must be interpretable (human-readable) to judge the final model.
"Is this model good enough to deploy?"
The simplest way to measure error: calculate the average of the absolute differences between predictions and actual values.
Interpretation: "On average, our prediction is off by X units." (e.g., $500 off in house price). It is robust to outliers.
Calculates the average of the squared errors.
Interpretation: Hard to read directly (e.g., "Error is 250,000 dollars squared"). However, it punishes large errors much harder than small ones.
The square root of the MSE.
Interpretation: Combines the benefits of MSE (punishing outliers) with the readability of MAE (back to original units like "Dollars").
The "Coefficient of Determination." It measures what percentage of the data's variance is explained by the model.
1. What is a "Residual"?
A residual is simply the "Leftover" error for a single data point. It is the distance between the Actual Value ($Y$) and the Predicted Value ($Y'$).
$$ \text{Residual} = Y_{actual} - Y'_{predicted} $$
2. SSR ( Sum of Squared Residuals )
We square every residual (to remove negatives) and add them all up. This represents the Total Error our model made.
$$ SSR = \sum ( Y_{actual} - Y'_{predicted} )^2 $$
3. SST ( Total Sum of Squares )
This is the "Baseline" error. Imagine we had NO model and just guessed the Mean ( Average ) for every single point. SST is the total error of that "dumb" guessing strategy.
$$ SST = \sum ( Y_{actual} - Y_{mean} )^2 $$
Intuition: $R^2$ compares our model's error (SSR) against the baseline error (SST). If our model is perfect, SSR is 0, so $R^2 = 1 - 0 = 1$.
The "Fair Judge." It adjusts the $R^2$ based on the number of features in the model. It penalizes you for adding features that don't help.
(where $n$ = samples, $k$ = number of features)
To understand why $R^2$ is "greedy," we have to look at how the computer calculates the weights ($\theta$).
1. The Optimizer's Job:
The computer's only goal is to make SSR ( Sum of Squared Residuals ) as small as possible. It is a "minimization machine."
2. More "Knobs" = More Chances to Cheat:
Every time you add a new feature (even a useless one like "Random Noise"), you are giving the computer a new knob ($\theta_{new}$) to turn.
3. Optimization by Luck:
Even if a feature has zero relationship with the answer, the computer will find a tiny, tiny correlation just by pure luck/chance.
By setting $\theta_{new}$ to a very small number, the computer can shave off a tiny bit of the SSR. It never has a reason to set $\theta$ to exactly zero (unless you use Lasso), so the error always goes down slightly.
Analogy: Imagine a student taking a multiple-choice test. If you give them 100 extra "lucky guesses," their score will likely go up by 1 or 2 points just by chance, even if they didn't study more. $R^2$ records that higher score, but Adjusted $R^2$ "subtracts" the points they got by luck.
You might wonder: "If MAE is easier for humans to understand, why don't we use it as the Cost Function for Gradient Descent?"
For Gradient Descent to work, the function must be differentiable. This means it must have a defined slope (gradient) at every single point.
The slope changes gently. Gradient Descent can "slide" down perfectly.
At the very bottom (Zero), the slope is undefined. GD can get "stuck" or bounce forever.
| Metric | Use as Cost? | Reason |
|---|---|---|
| MSE | Yes (Default) | Mathematically smooth and very stable. |
| RMSE | Rarely | Minimizing MSE is the same as RMSE, but MSE is easier to calculate. |
| MAE | Sometimes | Used in "Robust Regression," but harder for simple computers. |
Since RMSE is just $\sqrt{MSE}$, the point where error is lowest is the same for both. So why do we prefer MSE for training?