← Back to Parameter Optimization

Variants of Gradient Descent

The standard Gradient Descent algorithm works, but how much data should we use to calculate the slope? This choice creates three distinct "flavors" of the algorithm, each with its own mathematical properties and trade-offs.

1. Batch Gradient Descent

The Logic: "I will look at EVERY single exam paper before I calculate the average score."

In Batch GD, we use the entire dataset to calculate the gradient for a single step. We sum up the errors for all $n$ training examples.

The Math

The update rule sums the gradient over all $n$ samples:

\theta_{new} = \theta_{old} - \alpha \cdot \frac{1}{n} \sum_{i=1}^{n} \nabla J(\theta; x^{(i)}, y^{(i)})

Pros & Cons

Stable Smooth Convergence: Since it uses all data, the path to the minimum is a straight, smooth line.
Slow Computationally Expensive: If you have 1 million data points, you must calculate 1 million errors just to take ONE step.

2. Stochastic Gradient Descent (SGD)

The Logic: "I will look at ONE exam paper and immediately guess the class average."

In SGD, we update the parameters after looking at just one random training example.

The Sequential Process:
Unlike Batch GD, SGD doesn't wait.

Pick Sample A $\to$ Calculate Error $\to$ Update $\theta$ to $\theta_{new}$.
Pick Sample B $\to$ Use the new $\theta$ $\to$ Calculate Error $\to$ Update $\theta$ again.

This "one-by-one" iterative jumping is exactly how SGD navigates the cost landscape.

The Math

The update rule uses only a single sample $(x^{(i)}, y^{(i)})$:

\theta_{new} = \theta_{old} - \alpha \cdot \nabla J(\theta; x^{(i)}, y^{(i)})

Pros & Cons

Fast Instant Updates: You learn immediately. Great for huge datasets or streaming data.
Noisy Erratic Path: The gradient bounces around wildly. It might never "settle" perfectly at the minimum but will hover around it.

3. Mini-Batch Gradient Descent

The Logic: "I will grab a stack of 32 exams, calculate their average, and update."

This is the Goldilocks solution. We split data into small "batches" (e.g., 32, 64, or 128 samples).

The Math

The update rule sums over a small batch size $k$:

\theta_{new} = \theta_{old} - \alpha \cdot \frac{1}{k} \sum_{i=1}^{k} \nabla J(\theta; x^{(i)}, y^{(i)})

Pros & Cons

Balanced Best of Both Worlds: More stable than SGD, much faster than Batch GD.
Vectorization: Modern GPUs (Graphics Cards) are designed to do math on matrices of size 32 or 64 instantly. This makes Mini-Batch extremely efficient for Deep Learning.

Summary Comparison

Method	Data per Step	Speed (per step)	Stability
Batch GD	All Data ($n$)	Very Slow	High (Smooth)
SGD	1 Sample	Instant	Low (Noisy)
Mini-Batch	Batch ($k \approx 32$)	Fast	Medium (Good)