But gradient descent doesn’t modify a neural network one weight at a time
Sure, but the gradient component that is associated with a given weight is still zero if updating that weight alone would not affect loss.
What do you think the gradient of min(x, y) is?
Sure, but the gradient component that is associated with a given weight is still zero if updating that weight alone would not affect loss.
What do you think the gradient of min(x, y) is?