← Back to Parameter Optimization

Strategic Initialization

In deep neural networks, initializing weights with simple random numbers often causes problems. If weights are too small, the signal disappears (**Vanishing Gradient**). If they are too large, the signal explodes (**Exploding Gradient**).

Strategic initialization methods use the number of inputs ($n_{in}$) and outputs ($n_{out}$) to calculate the perfect variance for the starting weights.

1. Xavier ( Glorot ) Initialization

Best for: Sigmoid and Tanh activation functions.

Xavier initialization keeps the variance of the activations the same across layers. It draws weights from a distribution with:

$$ \sigma^2 = \frac{2}{n_{in} + n_{out}} $$

(Weights are sampled from $Normal(0, \sigma^2)$ or $Uniform(-\sqrt{3\sigma^2}, \sqrt{3\sigma^2})$)

2. He Initialization

Best for: ReLU ( Rectified Linear Unit ) and its variants.

Because ReLU "shuts off" half the neurons (negative values become 0), it needs a slightly larger starting variance to keep the signal alive.

$$ \sigma^2 = \frac{2}{n_{in}} $$

(Named after Kaiming He, the lead author of the ResNet paper)

Summary

For simple models like Linear Regression, initialization is not very sensitive—you can often start at zero or with basic random numbers and reach the minimum easily.

However, as models grow in complexity (like Deep Neural Networks), these strategic methods become essential. By scaling the starting weights based on the input size, we ensure the signals stay stable (neither becoming zero nor infinity), allowing the optimization process to begin from a healthy, balanced state.