An un-aligned AI has the decision of acting to maximize its goals in training and getting a higher short-term reward, or deceptively pretending to be aligned in training, and getting a lower short-term reward.
If there is a conflict between these, that must be because the AI’s conception of reward isn’t identical to the reward that we intended. So even if we dole out higher intended reward during deployment, it’s not clear that that increases the reward that the AI expects after deployment. (But it might.)
If there is a conflict between these, that must be because the AI’s conception of reward isn’t identical to the reward that we intended. So even if we dole out higher intended reward during deployment, it’s not clear that that increases the reward that the AI expects after deployment. (But it might.)