I have an idea of something I would call extrapolated reward: The premise is that we can avoid misalignment if we get the model to reward itself only with the things that it believes that we would reward it for if given infinite time to ponder and process our decisions. We start off with a first pass where the reward function behaves as normal. Then, we look at our answers with a bit more scrutiny—perhaps we found that an answer that we thought was good the first time around was actually deceptive in some way. We can do this second pass either for everything in our initial pass, or a subset, or maybe an entirely different set, depending on how well the model associates feedback A with feedback B. We repeat this process, investing more and more resources and reflection to our answers each time. During inference, the model gives its own prediction for the limit of each reward that we would give it, and acts accordingly.
I have an idea of something I would call extrapolated reward: The premise is that we can avoid misalignment if we get the model to reward itself only with the things that it believes that we would reward it for if given infinite time to ponder and process our decisions. We start off with a first pass where the reward function behaves as normal. Then, we look at our answers with a bit more scrutiny—perhaps we found that an answer that we thought was good the first time around was actually deceptive in some way. We can do this second pass either for everything in our initial pass, or a subset, or maybe an entirely different set, depending on how well the model associates feedback A with feedback B. We repeat this process, investing more and more resources and reflection to our answers each time. During inference, the model gives its own prediction for the limit of each reward that we would give it, and acts accordingly.