← Back to Parameter Optimization
Variants of Gradient Descent
The standard Gradient Descent algorithm works, but how much data should we use to calculate the slope? This choice creates three distinct "flavors" of the algorithm, each with its own mathematical properties and trade-offs.
1. Batch Gradient Descent
The Logic: "I will look at EVERY single exam paper before I calculate the average score."
In Batch GD, we use the entire dataset to calculate the gradient for a single step. We sum up the errors for all $n$ training examples.
The Math
The update rule sums the gradient over all $n$ samples:
$$ \theta_{new} = \theta_{old} - \alpha \cdot \frac{1}{n} \sum_{i=1}^{n} \nabla J(\theta; x^{(i)}, y^{(i)}) $$
Pros & Cons
- Stable Smooth Convergence: Since it uses all data, the path to the minimum is a straight, smooth line.
- Slow Computationally Expensive: If you have 1 million data points, you must calculate 1 million errors just to take ONE step.
2. Stochastic Gradient Descent (SGD)
The Logic: "I will look at ONE exam paper and immediately guess the class average."
In SGD, we update the parameters after looking at just one random training example.
The Sequential Process:
Unlike Batch GD, SGD doesn't wait.
- Pick Sample A $\to$ Calculate Error $\to$ Update $\theta$ to $\theta_{new}$.
- Pick Sample B $\to$ Use the new $\theta$ $\to$ Calculate Error $\to$ Update $\theta$ again.
This "one-by-one" iterative jumping is exactly how SGD navigates the cost landscape.
The Math
The update rule uses only a single sample $(x^{(i)}, y^{(i)})$:
$$ \theta_{new} = \theta_{old} - \alpha \cdot \nabla J(\theta; x^{(i)}, y^{(i)}) $$
Pros & Cons
- Fast Instant Updates: You learn immediately. Great for huge datasets or streaming data.
- Noisy Erratic Path: The gradient bounces around wildly. It might never "settle" perfectly at the minimum but will hover around it.
3. Mini-Batch Gradient Descent
The Logic: "I will grab a stack of 32 exams, calculate their average, and update."
This is the Goldilocks solution. We split data into small "batches" (e.g., 32, 64, or 128 samples).
The Math
The update rule sums over a small batch size $k$:
$$ \theta_{new} = \theta_{old} - \alpha \cdot \frac{1}{k} \sum_{i=1}^{k} \nabla J(\theta; x^{(i)}, y^{(i)}) $$
Pros & Cons
- Balanced Best of Both Worlds: More stable than SGD, much faster than Batch GD.
- Vectorization: Modern GPUs (Graphics Cards) are designed to do math on matrices of size 32 or 64 instantly. This makes Mini-Batch extremely efficient for Deep Learning.
Summary Comparison
| Method |
Data per Step |
Speed (per step) |
Stability |
| Batch GD |
All Data ($n$) |
Very Slow |
High (Smooth) |
| SGD |
1 Sample |
Instant |
Low (Noisy) |
| Mini-Batch |
Batch ($k \approx 32$) |
Fast |
Medium (Good) |