Partly it might be because it often is not “just” pure gradient descent. There are tweaks to it, like AdaGrad, that are sometimes used? These might be mostly about cost though. Getting to a “good enough answer” as quickly and cheaply as you can tends to be a relevant criteria of “practical success” in practical environments.
Partly it might be because it often is not “just” pure gradient descent. There are tweaks to it, like AdaGrad, that are sometimes used? These might be mostly about cost though. Getting to a “good enough answer” as quickly and cheaply as you can tends to be a relevant criteria of “practical success” in practical environments.