In Machine Learning, "Training" is essentially a search game. We are looking for the "Magic Numbers" ( weights ) ( and ) ( biases ) that allow our math formula to predict the future correctly.
Before we can "optimize" (improve) anything, we need to mathematically define what "bad" looks like. We call this the ( Cost ) ( Function ).
Step 1: The Prediction ($Y'$)
We pick some random random values for our parameter $\theta$ (weight) and $b$ (bias).
$ Y' = \theta \cdot X + b $
Step 2: The Comparison (Error)
We compare our prediction ($Y'$) against the actual answer ($Y$) to see how far off we are.
$ \text{Error} = Y' - Y $
Step 3: The Cost ($J$)
We cannot just add up the errors (negatives) (would) (cancel) (positives). So we square them and take the average.
This final number—the "Average Squared Error"—is what we call the Cost Function $J(\theta)$.
The Optimization Goal: Changing the weights ($\theta$) to find the specific value that makes $J(\theta)$ as small as possible.
Now that we defined the problem (The) (Cost) (Function), we need a method to solve it. There are two main ways to find the bottom of this valley.
For Linear Regression, there is a "Magic Formula" that calculates the perfect weights instantly, without any guessing. This is the Normal Equation.
👉 Click here for a Step-by-Step "Mountain Analogy" Explanation
When we can't solve the formula directly (most ML/Deep Learning models), we use an iterative approach.
The curve is the Cost Function (Error). The Ball is our Model.
Current Slope: --
If we have a model with 2 parameters (e.g., Slope $m$ and Intercept $b$), the Cost Function becomes a 3D Surface (a Bowl).
X & Z Axes: Parameters ( $\theta_0, \theta_1$ ) | Vertical Y Axis: Error ( $J$ )
What about more parameters?
If we have 100 features, the graph becomes a Hyper-Paraboloid (101 dimensions). We can't draw it, but the math tells us it keeps this same "Bowl" shape.
Because Linear Regression's cost function is always a "Bowl" (Convex), there is only one lowest point (Global Minimum). Gradient Descent is guaranteed to find it!
Knowing the math is one thing; making it work in practice is another. Here are the two biggest hurdles you will face.
This is the "Goldilocks" problem.
The Problem: If one feature is tiny (1-5) and another is huge (100k+), the Cost Function becomes a Flattened Bowl (like a long, thin trench).
In a flattened bowl, Gradient Descent gets confused. It bounces wildly against the steep side-walls but makes almost zero progress along the long, flat floor.
( Left: Scaled Features - Circular contours allow a direct path. | Right: Unscaled Features - Elongated contours cause a zigzag path. )
When using Gradient Descent, you should ensure that all features have a similar scale (e.g., using Scikit-Learn’s StandardScaler class), or else it will take much longer to converge.
Gradient Descent is a journey. Every journey needs a starting point. The values we choose for $\theta$ at the very beginning can determine how fast (or if) we find the solution.