It doesn’t always work when the goal is to find a good enough answer. When it fails, you fiddle with something and try again. You sure aren’t getting optimal large datasets. On many large problems, each piece of training data is only used once. This means the first few steps are applied to randomness, and the last few steps can only make a tiny change.
Actually, there are momentum methods, ADAM etc that are often used instead of gradient descent.
It doesn’t always work when the goal is to find a good enough answer. When it fails, you fiddle with something and try again. You sure aren’t getting optimal large datasets. On many large problems, each piece of training data is only used once. This means the first few steps are applied to randomness, and the last few steps can only make a tiny change.
Actually, there are momentum methods, ADAM etc that are often used instead of gradient descent.