Some of the networks that have an accurate model of the training process will stumble upon the strategy of failing hard if SGD would reward any other competing network
I think the part in bold should instead be something like “failing hard if SGD would (not) update weights in such and such way”. (SGD is a local search algorithm; it gradually improves a single network.)
This strategy seems more complicated, so is less likely to randomly exist in a network, but it is very strongly selected for, since at least from an evolutionary perspective it appears like it would give the network a substantive advantage.
As I already argued in another thread, the idea is not that SGD creates the gradient hacking logic specifically (in case this is what you had in mind here). As an analogy, consider a human that decides to 1-box in Newcomb’s problem (which is related to the idea of gradient hacking, because the human decides to 1-box in order to have the property of “being a person that 1-boxs”, because having that property is instrumentally useful). The specific strategy to 1-box is not selected for by human evolution, but rather general problem-solving capabilities were (and those capabilities resulted in the human coming up with the 1-box strategy).
I think the part in bold should instead be something like “failing hard if SGD would (not) update weights in such and such way”. (SGD is a local search algorithm; it gradually improves a single network.)
As I already argued in another thread, the idea is not that SGD creates the gradient hacking logic specifically (in case this is what you had in mind here). As an analogy, consider a human that decides to 1-box in Newcomb’s problem (which is related to the idea of gradient hacking, because the human decides to 1-box in order to have the property of “being a person that 1-boxs”, because having that property is instrumentally useful). The specific strategy to 1-box is not selected for by human evolution, but rather general problem-solving capabilities were (and those capabilities resulted in the human coming up with the 1-box strategy).
Thanks for the concrete example, I think I understand better what you meant. What you describe looks like the hypothesis “Any sufficiently intelligent model will be able to gradient hack, and thus will do it”. Which might be true. But I’m actually more interested in the question of how gradient hacking could emerge without having to pass that threshold of intelligence, because I believe such examples will be easier to interpret and study.
So in summary, I do think what you say makes sense for the general risk of gradient hacking, yet I don’t believe it is really useful for studying gradient hacking with our current knowledge.
It does seem useful to make the distinction between thinking about how gradient hacking failures look like in worlds where they cause an existential catastrophe, and thinking about how to best pursue empirical research today about gradient hacking.
I think the part in bold should instead be something like “failing hard if SGD would (not) update weights in such and such way”. (SGD is a local search algorithm; it gradually improves a single network.)
As I already argued in another thread, the idea is not that SGD creates the gradient hacking logic specifically (in case this is what you had in mind here). As an analogy, consider a human that decides to 1-box in Newcomb’s problem (which is related to the idea of gradient hacking, because the human decides to 1-box in order to have the property of “being a person that 1-boxs”, because having that property is instrumentally useful). The specific strategy to 1-box is not selected for by human evolution, but rather general problem-solving capabilities were (and those capabilities resulted in the human coming up with the 1-box strategy).
Agreed. I said something similar in my comment.
Thanks for the concrete example, I think I understand better what you meant. What you describe looks like the hypothesis “Any sufficiently intelligent model will be able to gradient hack, and thus will do it”. Which might be true. But I’m actually more interested in the question of how gradient hacking could emerge without having to pass that threshold of intelligence, because I believe such examples will be easier to interpret and study.
So in summary, I do think what you say makes sense for the general risk of gradient hacking, yet I don’t believe it is really useful for studying gradient hacking with our current knowledge.
It does seem useful to make the distinction between thinking about how gradient hacking failures look like in worlds where they cause an existential catastrophe, and thinking about how to best pursue empirical research today about gradient hacking.