In deep neural networks, initializing weights with simple random numbers often causes problems. If weights are too small, the signal disappears (**Vanishing Gradient**). If they are too large, the signal explodes (**Exploding Gradient**).
Strategic initialization methods use the number of inputs ($n_{in}$) and outputs ($n_{out}$) to calculate the perfect variance for the starting weights.
Xavier initialization keeps the variance of the activations the same across layers. It draws weights from a distribution with:
(Weights are sampled from $Normal(0, \sigma^2)$ or $Uniform(-\sqrt{3\sigma^2}, \sqrt{3\sigma^2})$)
Because ReLU "shuts off" half the neurons (negative values become 0), it needs a slightly larger starting variance to keep the signal alive.
(Named after Kaiming He, the lead author of the ResNet paper)
For simple models like Linear Regression, initialization is not very sensitive—you can often start at zero or with basic random numbers and reach the minimum easily.
However, as models grow in complexity (like Deep Neural Networks), these strategic methods become essential. By scaling the starting weights based on the input size, we ensure the signals stay stable (neither becoming zero nor infinity), allowing the optimization process to begin from a healthy, balanced state.