To understand Gradient Descent, imagine you are standing at the top of a mountain (the point of maximum error) and you want to get to the very bottom of the valley (the point of minimum error).
The catch? You are blindfolded. You can only feel the slope of the ground under your feet to decide which way to move.
You start by picking a random value for your parameters, $\theta$ (the weights and bias). At this stage, your model is essentially guessing blindly.
The Result: Your cost function will likely be very high because the "guess" is probably wrong.
You feel the ground around you to see which way is "downhill." Mathematically, this is called taking the Derivative of the cost function.
The "Gradient" is just a fancy word for the slope.
Once you know the direction, you take a step. But how big should that step be? This is determined by the Learning Rate ($\alpha$).
After your step, you update your position. The formula for this update is:
(We subtract because we want to go down the slope, not up).
You repeat Steps 2 through 4 thousands of times. With every step, your cost function value gets smaller, and your "prediction" gets closer to the "actual" value.
Eventually, the ground becomes flat. The slope becomes zero. No matter which way you move, you aren't going "down" anymore.
This is called Convergence. You have found the optimal $\theta$!