When you say “which yields a solution of the form f(w)=c1/(1−w)+c2”, are you saying that f′(w)/f(w)=1/(1−w) yields that, or are you saying that (1−w)f′(w)−f(w)>0 yields that? Because, for the former, that seems wrong? Specifically, the former should yield only things of the form f(w)=c1/(1−w) .
But, if the latter, then, I would think that the solutions would be more solutions than that?
Like, what about g(w):=c1/(1−w)+c2⋅(1−1106+1106cos(w)) ? (where, say, c1=ε+δ and c2=−δ g′(w)=c1(1−w)2+c2⋅(−1106sin(w)) . so (1−w)g′(w)−g(w)=c2⋅(−1106(1−w)sin(w))−c2⋅(1−1106+1106cos(w)) =−c2⋅(1−10−6⋅(1−cos(w)−(1−w)sin(w))) which, for c2<0, and w∈[0,1], is positive, and so g should also be a solution to (1−w)f′(w)−f(w)>0 , yes?
Or, were you just giving a subset of the solutions, of the differential inequality, which were all the ones you needed to make the point?
Separately (btw, what’s the convention if one has unrelated comments? to put them in one comment or separate comments?) : As for how the network could detect something about its current weights, if it is a recurrent network, this seems like something that some form of this should be learnable.
Suppose you set up the architecture for a recurrent neural network, and pick some arbitrary linear combination (but where the coefficients aren’t especially big) of the weights in the network, and then for the loss function, take the main part of the network’s output as doing some normal task, and then another part of the output be judged for how well it approximates the current value of the linear combination of all the network’s weights, with the initially chosen coefficients. It seems plausible to me that this would do well. Of course, one way it might fail to work properly is if this linear combination ends up becoming roughly constant, or just if some weights stop changing during the training, leading to this second output not getting the signal needed to learn that part. Maybe if you used dropout and took that which weights were dropped out when evaluating the weighted sum of the weights (counting them as zero) in the loss, it could fix that issue. (dropout being done the same in each recurrence, not separately.)
Why “a linear combination of the weights” instead of the particular weights? Well, I imagine getting all the individual weights would be not particularly feasible unless the architecture was specifically designed to do that (like in the architectures for those quine-ing neural nets), but, if we are looking at a mix between “how the net should be”, and “a net that is doing gradient hacking”, then it seems like “some particular weighted sum of the weights”, might be all that would be needed. Or perhaps two weighted sums with different coefficients if having the w1 and w2 separate.
On the other hand, if the network isn’t recurrent, or if all the recurrence is done in a vector-quantization-y way, I would imagine it would be more difficult for the network to estimate a given linear combination of its current weights. (Though, if the linear combination of weights is only of weights before a certain layer, then layers after that layer might be able to approximate it without any recurrence, but, the layers that do and use this approximation could be modified without this being detected, and therefore, I’d think it couldn’t protect these later parts from being changed by gradient descent, or at least, it would be harder.)
(I’m imagining, if people are trying to make this sort of thing work, that these networks are using just RELU, not like a smoothed version of RELU, so that the values of the weights are reflected more piecewise-linearly in the activations of one layer, given activations in the previous layer, which I think would facilitate estimating other linear combinations of the weights between the previous layers, and as such also for estimating this for a linear combination of weights in any layer.)
When you say “which yields a solution of the form f(w)=c1/(1−w)+c2”, are you saying that f′(w)/f(w)=1/(1−w) yields that, or are you saying that (1−w)f′(w)−f(w)>0 yields that? Because, for the former, that seems wrong? Specifically, the former should yield only things of the form f(w)=c1/(1−w) .
I meant the former (which you’re right only has the solution with c1). I only added the c2 term to make it work for the inequality. As a result, it’s only a subset of the solutions for the inequality. The (quite complicated!) expression you provided also works.
When you say “which yields a solution of the form f(w)=c1/(1−w)+c2”, are you saying that f′(w)/f(w)=1/(1−w) yields that, or are you saying that (1−w)f′(w)−f(w)>0 yields that? Because, for the former, that seems wrong? Specifically, the former should yield only things of the form f(w)=c1/(1−w) .
But, if the latter, then, I would think that the solutions would be more solutions than that?
Like, what about g(w):=c1/(1−w)+c2⋅(1−1106+1106cos(w)) ? (where, say, c1=ε+δ and c2=−δ
g′(w)=c1(1−w)2+c2⋅(−1106sin(w)) . so (1−w)g′(w)−g(w)=c2⋅(−1106(1−w)sin(w))−c2⋅(1−1106+1106cos(w))
=−c2⋅(1−10−6⋅(1−cos(w)−(1−w)sin(w)))
which, for c2<0, and w∈[0,1], is positive, and so g should also be a solution to (1−w)f′(w)−f(w)>0 , yes?
Or, were you just giving a subset of the solutions, of the differential inequality, which were all the ones you needed to make the point?
Separately (btw, what’s the convention if one has unrelated comments? to put them in one comment or separate comments?) :
As for how the network could detect something about its current weights, if it is a recurrent network, this seems like something that some form of this should be learnable.
Suppose you set up the architecture for a recurrent neural network, and pick some arbitrary linear combination (but where the coefficients aren’t especially big) of the weights in the network, and then for the loss function, take the main part of the network’s output as doing some normal task, and then another part of the output be judged for how well it approximates the current value of the linear combination of all the network’s weights, with the initially chosen coefficients. It seems plausible to me that this would do well. Of course, one way it might fail to work properly is if this linear combination ends up becoming roughly constant, or just if some weights stop changing during the training, leading to this second output not getting the signal needed to learn that part. Maybe if you used dropout and took that which weights were dropped out when evaluating the weighted sum of the weights (counting them as zero) in the loss, it could fix that issue. (dropout being done the same in each recurrence, not separately.)
Why “a linear combination of the weights” instead of the particular weights? Well, I imagine getting all the individual weights would be not particularly feasible unless the architecture was specifically designed to do that (like in the architectures for those quine-ing neural nets), but, if we are looking at a mix between “how the net should be”, and “a net that is doing gradient hacking”, then it seems like “some particular weighted sum of the weights”, might be all that would be needed. Or perhaps two weighted sums with different coefficients if having the w1 and w2 separate.
On the other hand, if the network isn’t recurrent, or if all the recurrence is done in a vector-quantization-y way, I would imagine it would be more difficult for the network to estimate a given linear combination of its current weights. (Though, if the linear combination of weights is only of weights before a certain layer, then layers after that layer might be able to approximate it without any recurrence, but, the layers that do and use this approximation could be modified without this being detected, and therefore, I’d think it couldn’t protect these later parts from being changed by gradient descent, or at least, it would be harder.)
(I’m imagining, if people are trying to make this sort of thing work, that these networks are using just RELU, not like a smoothed version of RELU, so that the values of the weights are reflected more piecewise-linearly in the activations of one layer, given activations in the previous layer, which I think would facilitate estimating other linear combinations of the weights between the previous layers, and as such also for estimating this for a linear combination of weights in any layer.)
I meant the former (which you’re right only has the solution with c1). I only added the c2 term to make it work for the inequality. As a result, it’s only a subset of the solutions for the inequality. The (quite complicated!) expression you provided also works.