there will be various ways in which the best available way to maximize reward will in effect be to maximize something else that tends to maximize reward
I further think that it’s generally wrong[1] to ask “does this policy maximize reward?” in order to predict whether such a policy will be trained. It’s at least a tiny bit useful, but I don’t think it’s very useful. Do you disagree?
I further think that it’s generally wrong[1] to ask “does this policy maximize reward?” in order to predict whether such a policy will be trained. It’s at least a tiny bit useful, but I don’t think it’s very useful. Do you disagree?
There are some environments where that question is appropriate, and I don’t think those environments include LLM finetuning.