I am not fully convinced I am wrong about this, but I am confident I am not making the simple form of this mistake that I believe you think that I am making here. I do get that models don’t ‘get reward’ and that there will be various ways in which the best available way to maximize reward will in effect be to maximize something else that tends to maximize reward, although I expect this difference to shrink as capabilities advance.
there will be various ways in which the best available way to maximize reward will in effect be to maximize something else that tends to maximize reward
I further think that it’s generally wrong[1] to ask “does this policy maximize reward?” in order to predict whether such a policy will be trained. It’s at least a tiny bit useful, but I don’t think it’s very useful. Do you disagree?
I am not fully convinced I am wrong about this, but I am confident I am not making the simple form of this mistake that I believe you think that I am making here. I do get that models don’t ‘get reward’ and that there will be various ways in which the best available way to maximize reward will in effect be to maximize something else that tends to maximize reward, although I expect this difference to shrink as capabilities advance.
I further think that it’s generally wrong[1] to ask “does this policy maximize reward?” in order to predict whether such a policy will be trained. It’s at least a tiny bit useful, but I don’t think it’s very useful. Do you disagree?
There are some environments where that question is appropriate, and I don’t think those environments include LLM finetuning.