Evaluation Metrics

2. Common Regression Metrics

Mean Absolute Error ( MAE )

The simplest way to measure error: calculate the average of the absolute differences between predictions and actual values.

$$ MAE = \frac{1}{n} \sum | Y' - Y | $$

Interpretation: "On average, our prediction is off by X units." (e.g., $500 off in house price). It is robust to outliers.

Mean Squared Error ( MSE )

Calculates the average of the squared errors.

$$ MSE = \frac{1}{n} \sum ( Y' - Y )^2 $$

Interpretation: Hard to read directly (e.g., "Error is 250,000 dollars squared"). However, it punishes large errors much harder than small ones.

Root Mean Squared Error ( RMSE )

The square root of the MSE.

$$ RMSE = \sqrt{MSE} $$

Interpretation: Combines the benefits of MSE (punishing outliers) with the readability of MAE (back to original units like "Dollars").

R-Squared ( $R^2$ )

The "Coefficient of Determination." It measures what percentage of the data's variance is explained by the model.

$$ R^2 = 1 - \frac{\text{SSR}}{\text{SST}} $$

The Math Breakdown: SSR & SST

1. What is a "Residual"?
A residual is simply the "Leftover" error for a single data point. It is the distance between the Actual Value ($Y$) and the Predicted Value ($Y'$).
$$ \text{Residual} = Y_{actual} - Y'_{predicted} $$

2. SSR ( Sum of Squared Residuals )
We square every residual (to remove negatives) and add them all up. This represents the Total Error our model made.
$$ SSR = \sum ( Y_{actual} - Y'_{predicted} )^2 $$

3. SST ( Total Sum of Squares )
This is the "Baseline" error. Imagine we had NO model and just guessed the Mean ( Average ) for every single point. SST is the total error of that "dumb" guessing strategy.
$$ SST = \sum ( Y_{actual} - Y_{mean} )^2 $$

Intuition: $R^2$ compares our model's error (SSR) against the baseline error (SST). If our model is perfect, SSR is 0, so $R^2 = 1 - 0 = 1$.

$R^2 = 1.0$: Perfect model.
$R^2 = 0.0$: Model is as bad as just guessing the average every time.

The Problem with $R^2$:
$R^2$ is greedy. If you add any new feature—even a useless one like "Random Noise"—the $R^2$ will **always** increase or stay the same. It never goes down. This makes it a poor judge for complex models.

Adjusted R-Squared

The "Fair Judge." It adjusts the $R^2$ based on the number of features in the model. It penalizes you for adding features that don't help.

$$ \text{Adjusted } R^2 = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - k - 1} \right) $$

(where $n$ = samples, $k$ = number of features)

Intuition: $R^2$

Add useful feature $\to$ $R^2$ ↑
Add useless feature $\to$ $R^2$ ↑ (False sense of security!)

Intuition: Adjusted $R^2$

Feature improves model enough $\to$ Adj $R^2$ ↑
Feature adds little/no value $\to$ Adj $R^2$ ↓
Model gets worse $\to$ Adj $R^2$ ↓↓

Math Deep Dive: Why does $R^2$ "fail" by always increasing?

To understand why $R^2$ is "greedy," we have to look at how the computer calculates the weights ($\theta$).

1. The Optimizer's Job:
The computer's only goal is to make SSR ( Sum of Squared Residuals ) as small as possible. It is a "minimization machine."

2. More "Knobs" = More Chances to Cheat:
Every time you add a new feature (even a useless one like "Random Noise"), you are giving the computer a new knob ($\theta_{new}$) to turn.

3. Optimization by Luck:
Even if a feature has zero relationship with the answer, the computer will find a tiny, tiny correlation just by pure luck/chance.

By setting $\theta_{new}$ to a very small number, the computer can shave off a tiny bit of the SSR. It never has a reason to set $\theta$ to exactly zero (unless you use Lasso), so the error always goes down slightly.

The Mathematical Result:
Since $R^2 = 1 - \frac{SSR}{SST}$:
As $SSR$ ( Error ) goes down $\to$ The fraction $\frac{SSR}{SST}$ gets smaller $\to$ $R^2$ goes UP.

Analogy: Imagine a student taking a multiple-choice test. If you give them 100 extra "lucky guesses," their score will likely go up by 1 or 2 points just by chance, even if they didn't study more. $R^2$ records that higher score, but Adjusted $R^2$ "subtracts" the points they got by luck.

3. Can any metric be a Cost Function?

You might wonder: "If MAE is easier for humans to understand, why don't we use it as the Cost Function for Gradient Descent?"

The Secret Requirement: Smoothness ( Differentiability )

For Gradient Descent to work, the function must be differentiable. This means it must have a defined slope (gradient) at every single point.

MSE ( Smooth Bowl )

The slope changes gently. Gradient Descent can "slide" down perfectly.

MAE ( Sharp "V" )

At the very bottom (Zero), the slope is undefined. GD can get "stuck" or bounce forever.

Summary Table

Metric	Use as Cost?	Reason
MSE	Yes (Default)	Mathematically smooth and very stable.
RMSE	Rarely	Minimizing MSE is the same as RMSE, but MSE is easier to calculate.
MAE	Sometimes	Used in "Robust Regression," but harder for simple computers.

The MSE vs. RMSE Showdown

Since RMSE is just $\sqrt{MSE}$, the point where error is lowest is the same for both. So why do we prefer MSE for training?

Simpler Math: The derivative of $x^2$ is just $2x$. The derivative of $\sqrt{x}$ is $\frac{1}{2\sqrt{x}}$, which is more complex for the computer.
Stability: As we get close to the bottom (Zero Error), the $\sqrt{x}$ derivative can become unstable (division by zero issues), while the $x^2$ derivative just gently becomes zero.

1. Cost Function vs. Evaluation Metric

Cost Function ($J$)