Ofer comments on Gradient descent is not just more efficient genetic algorithms

Ofer 9 Sep 2021 17:23 UTC
1 point

My formulation is broad enough that it doesn’t have to be a dedicated piece of logic, there just has to be some way of looking at the reset of the network that depends on X and Y being the same.

But X and Y are not the same! For example, if the model is intended to classify images of animals, the computation X may correspond to [how many legs does the animal have?] and Y may correspond to [how large is the animal?]

This is what I take issue with—if there is a way to change both components simultaneously to have an effect on the loss, SGD will happily do that.

This seems to me wrong. SGD updates the weights in the direction of the gradient, and if changing a given weight alone does not affect the loss then the gradient component that is associated with that weight will be 0 and thus SGD will not change that weight.
- leogao 9 Sep 2021 21:34 UTC
  1 point
  Parent
  SGD updates the weights in the direction of the gradient, and if changing a given weight alone does not affect the loss then the gradient component that is associated with that weight will be 0 and thus SGD will not change that weight.
  If the partial derivative wrt two different parameters is zero, i.e $\frac{\partial f}{\partial θ_{1}} = \frac{\partial f}{\partial θ_{2}} = 0$ , then it must be that changing both simultaneously does not change the loss either (to be precise, ${lim}_{h \to 0} \frac{f (x + h (θ_{1} + θ_{2}))}{h} = 0$ ).
  - Ofer 10 Sep 2021 23:15 UTC
    1 point
    Parent
    I don’t see how this is relevant here. If it is the case that changing only $w_{1}$ does not affect the loss, and changing only $w_{2}$ does not affect the loss, then SGD would not change them (their gradient components will be zero), even if changing them both can affect the loss.
    - leogao 10 Sep 2021 23:55 UTC
      2 points
      Parent
      It’s relevant because it demonstrates that in differentiable functions, if it is the case that changing only $w_{1}$ does not affect the loss, and changing only $w_{2}$ does not affect the loss, then it is not possible that changing them both can affect the loss either.