Actually, I did meant that SGD might stumble upon gradient hacking. Or to be a bit more realistic, make the model slightly deceptive, at which point decreasing a bit the deceptiveness makes the model worse but increasing it a bit makes the model better at the base-objective, and so there is a push towards deceptiveness, until the model is basically deceptive enough to use gradient hacking in the way you mention.
I think that if SGD makes the model slightly deceptive it’s because it made the model slightly more capable (better at general problem solving etc.), which allowed the model to “figure out” (during inference) that acting in a certain deceptive way is beneficial with respect to the mesa-objective.
This seems to me a lot more likely than SGD creating specifically “deceptive logic” (i.e. logic that can’t do anything generally useful other than finding ways to perform better on the mesa-objective by being deceptive).
I agree with your intuition, but I want to point out again that after some initial useless amount, “deceptive logic” is probably a pretty useful thing in general for the model, because it helps improve performance as measured through the base-objective.
“deceptive logic” is probably a pretty useful thing in general for the model, because it helps improve performance as measured through the base-objective.
But you can similarly say this for the following logic: “check whether 1+1<4 and if so, act according to the base objective”. Why is SGD more likely to create “deceptive logic” than this simpler logic (or any other similar logic)?
[EDIT: actually, this argument doesn’t work in a setup where the base objective corresponds to a sufficiently long time horizon during which it is possible for humans to detect misalignment and terminate/modify the model (in a way the is harmful with respect to the base objective).]
So my understanding is that deceptive behavior is a lot more likely to arise from general-problem-solving logic, rather than SGD directly creating “deceptive logic” specifically.
Hum, I would say that your logic is probably redundant, and thus might end up being removed for simplicity reasons? Whereas I expect deceptive logic to includes very useful things like knowing how the optimization process works, which would definitely help having better performance.
But to be honest, how can SGD create gradient hacking (if it’s even possible) is completely an open research problem.
My point was that there’s no reason that SGD will create specifically “deceptive logic” because “deceptive logic” is not privileged over any other logic that involves modeling the base objective and acting according to it. But I now think this isn’t always true—see the edit block I just added.
Actually, I did meant that SGD might stumble upon gradient hacking. Or to be a bit more realistic, make the model slightly deceptive, at which point decreasing a bit the deceptiveness makes the model worse but increasing it a bit makes the model better at the base-objective, and so there is a push towards deceptiveness, until the model is basically deceptive enough to use gradient hacking in the way you mention.
Does that make more sense to you?
I think that if SGD makes the model slightly deceptive it’s because it made the model slightly more capable (better at general problem solving etc.), which allowed the model to “figure out” (during inference) that acting in a certain deceptive way is beneficial with respect to the mesa-objective.
This seems to me a lot more likely than SGD creating specifically “deceptive logic” (i.e. logic that can’t do anything generally useful other than finding ways to perform better on the mesa-objective by being deceptive).
I agree with your intuition, but I want to point out again that after some initial useless amount, “deceptive logic” is probably a pretty useful thing in general for the model, because it helps improve performance as measured through the base-objective.
SGD making the model more capable seems the most obvious way to satisfy the conditions for deceptive alignement.
But you can similarly say this for the following logic: “check whether 1+1<4 and if so, act according to the base objective”. Why is SGD more likely to create “deceptive logic” than this simpler logic (or any other similar logic)?
[EDIT: actually, this argument doesn’t work in a setup where the base objective corresponds to a sufficiently long time horizon during which it is possible for humans to detect misalignment and terminate/modify the model (in a way the is harmful with respect to the base objective).]
So my understanding is that deceptive behavior is a lot more likely to arise from general-problem-solving logic, rather than SGD directly creating “deceptive logic” specifically.
Hum, I would say that your logic is probably redundant, and thus might end up being removed for simplicity reasons? Whereas I expect deceptive logic to includes very useful things like knowing how the optimization process works, which would definitely help having better performance.
But to be honest, how can SGD create gradient hacking (if it’s even possible) is completely an open research problem.
My point was that there’s no reason that SGD will create specifically “deceptive logic” because “deceptive logic” is not privileged over any other logic that involves modeling the base objective and acting according to it. But I now think this isn’t always true—see the edit block I just added.