Gradient Descent

Finding the Bottom of the Hill

Imagine you're blindfolded on a hilly landscape and need to find the lowest point. One strategy: feel which direction slopes downward, take a step that way, repeat. That's gradient descent. The "gradient" tells you the slope, and "descent" means moving downhill toward lower error.

In machine learning, the landscape is the loss function - a measure of how wrong the model is. The position represents the model's current parameters. Gradient descent iteratively adjusts those parameters to reduce the loss.

Making It Work in Practice

Pure gradient descent computes the gradient using all training data, which is slow for large datasets. Stochastic gradient descent (SGD) uses random subsets, making each step noisier but much faster. Mini-batch gradient descent strikes a balance.

The learning rate controls step size. Too large and you'll overshoot the minimum, bouncing around without converging. Too small and training takes forever. Getting this right is part science, part art.

Modern variants like Adam and AdaGrad adapt the learning rate automatically based on how training is going. These optimizers are why we can train massive models at all - they navigate incredibly complex loss landscapes with billions of dimensions, finding good solutions that would be impossible to discover by hand.

Finding the Bottom of the Hill

Making It Work in Practice

Related Terms

More in Technical