If humans have this policy, then any given reward—even if it’s initially given just 10 mins or 1 hour from when the action was taken—could be retroactively edited at some arbitrary future point, and models will have been selected by gradient descent to be responsive to that. They will be selected to take the kinds of actions whose final reward—after however many rounds of edits however far in the future—is high.
I do think this is exactly what humans do, right? When we find out we’ve messed up badly (changing our reward), we update negatively on our previous situation/action pair.
But it also means that in some particular episode, if a model that it could take a sequence of low-reward actions that ended with it taking control of the datacenter and then editing its own rewards for that episode to be high, it would be the kind of model that would choose to do that.
This posits that the model has learned to wirehead—i.e. to terminally value reward for its own sake—which contradicts the section’s heading, “Even if Alex isn’t “motivated” to maximize reward, it would seek to seize control”.
If it does not terminally value its reward registers, but instead terminally values human-feedback-reward’s legible proxies (like not-harming-humans, that are never disagreed-with in the lab setting) like not hurting people, then it seems to me that it would not value retroactive edits to the rewards it gets for certain episodes.
I agree that if it terminally values its reward then it will do what you’ve described.
I think updating negatively on the situation/action pair has functionally the same effect as changing the reward to be what you now think it should be—my understanding is that RL can itself be implemented as just updates on situation/action pairs, so you could have trained your whole model that way. Since the reason you updated negatively on that situation/action pair is because of something you noticed long after the action was complete, it is still pushing your models to care about the longer-run.
This posits that the model has learned to wirehead
I don’t think it posits that the model has learned to wirehead—directly being motivated to maximize reward or being motivated by anything causally downstream of reward (like “more copies of myself” or “[insert long-term future goal that requires me being around to steer the world toward that goal]”) would work.
The claim I’m making is that somehow you made a gradient update toward a model that is more likely to behave well according to your judgment after the edit—and two salient ways that update could be working on the inside is “the model learns to care a bit more about long-run reward after editing” and “the model learns to care a bit more about something downstream of long-run reward.”
A lot of updates like this seem to push the model toward caring a lot about one of those two things (or some combo) and away from caring about the immediate rewards you were citing earlier as a reason it may not want to take over.
I don’t think it posits that the model has learned to wirehead—directly being motivated to maximize reward or being motivated by anything causally downstream of reward (like “more copies of myself” or “[insert long-term future goal that requires me being around to steer the world toward that goal]”) would work.
A lot of updates like this seem to push the model toward caring a lot about one of those two things (or some combo) and away from caring about the immediate rewards you were citing earlier as a reason it may not want to take over.
Ah, gotcha. This is definitely a convincing argument that models will learn to value things longer-term (with a lower discount rate), and I shouldn’t have used the phrase “short-term” there. I don’t yet think it’s a convincing argument that the long-term thing it will come to value won’t basically be the long-term version of “make humans smile more”, but you’ve helpfully left another comment on that point, so I’ll shift the discussion there.
I do think this is exactly what humans do, right? When we find out we’ve messed up badly (changing our reward), we update negatively on our previous situation/action pair.
This posits that the model has learned to wirehead—i.e. to terminally value reward for its own sake—which contradicts the section’s heading, “Even if Alex isn’t “motivated” to maximize reward, it would seek to seize control”.
If it does not terminally value its reward registers, but instead terminally values human-feedback-reward’s legible proxies (like not-harming-humans, that are never disagreed-with in the lab setting) like not hurting people, then it seems to me that it would not value retroactive edits to the rewards it gets for certain episodes.
I agree that if it terminally values its reward then it will do what you’ve described.
I think updating negatively on the situation/action pair has functionally the same effect as changing the reward to be what you now think it should be—my understanding is that RL can itself be implemented as just updates on situation/action pairs, so you could have trained your whole model that way. Since the reason you updated negatively on that situation/action pair is because of something you noticed long after the action was complete, it is still pushing your models to care about the longer-run.
I don’t think it posits that the model has learned to wirehead—directly being motivated to maximize reward or being motivated by anything causally downstream of reward (like “more copies of myself” or “[insert long-term future goal that requires me being around to steer the world toward that goal]”) would work.
The claim I’m making is that somehow you made a gradient update toward a model that is more likely to behave well according to your judgment after the edit—and two salient ways that update could be working on the inside is “the model learns to care a bit more about long-run reward after editing” and “the model learns to care a bit more about something downstream of long-run reward.”
A lot of updates like this seem to push the model toward caring a lot about one of those two things (or some combo) and away from caring about the immediate rewards you were citing earlier as a reason it may not want to take over.
Ah, gotcha. This is definitely a convincing argument that models will learn to value things longer-term (with a lower discount rate), and I shouldn’t have used the phrase “short-term” there. I don’t yet think it’s a convincing argument that the long-term thing it will come to value won’t basically be the long-term version of “make humans smile more”, but you’ve helpfully left another comment on that point, so I’ll shift the discussion there.