I think updating negatively on the situation/action pair has functionally the same effect as changing the reward to be what you now think it should be—my understanding is that RL can itself be implemented as just updates on situation/action pairs, so you could have trained your whole model that way. Since the reason you updated negatively on that situation/action pair is because of something you noticed long after the action was complete, it is still pushing your models to care about the longer-run.
This posits that the model has learned to wirehead
I don’t think it posits that the model has learned to wirehead—directly being motivated to maximize reward or being motivated by anything causally downstream of reward (like “more copies of myself” or “[insert long-term future goal that requires me being around to steer the world toward that goal]”) would work.
The claim I’m making is that somehow you made a gradient update toward a model that is more likely to behave well according to your judgment after the edit—and two salient ways that update could be working on the inside is “the model learns to care a bit more about long-run reward after editing” and “the model learns to care a bit more about something downstream of long-run reward.”
A lot of updates like this seem to push the model toward caring a lot about one of those two things (or some combo) and away from caring about the immediate rewards you were citing earlier as a reason it may not want to take over.
I don’t think it posits that the model has learned to wirehead—directly being motivated to maximize reward or being motivated by anything causally downstream of reward (like “more copies of myself” or “[insert long-term future goal that requires me being around to steer the world toward that goal]”) would work.
A lot of updates like this seem to push the model toward caring a lot about one of those two things (or some combo) and away from caring about the immediate rewards you were citing earlier as a reason it may not want to take over.
Ah, gotcha. This is definitely a convincing argument that models will learn to value things longer-term (with a lower discount rate), and I shouldn’t have used the phrase “short-term” there. I don’t yet think it’s a convincing argument that the long-term thing it will come to value won’t basically be the long-term version of “make humans smile more”, but you’ve helpfully left another comment on that point, so I’ll shift the discussion there.
I think updating negatively on the situation/action pair has functionally the same effect as changing the reward to be what you now think it should be—my understanding is that RL can itself be implemented as just updates on situation/action pairs, so you could have trained your whole model that way. Since the reason you updated negatively on that situation/action pair is because of something you noticed long after the action was complete, it is still pushing your models to care about the longer-run.
I don’t think it posits that the model has learned to wirehead—directly being motivated to maximize reward or being motivated by anything causally downstream of reward (like “more copies of myself” or “[insert long-term future goal that requires me being around to steer the world toward that goal]”) would work.
The claim I’m making is that somehow you made a gradient update toward a model that is more likely to behave well according to your judgment after the edit—and two salient ways that update could be working on the inside is “the model learns to care a bit more about long-run reward after editing” and “the model learns to care a bit more about something downstream of long-run reward.”
A lot of updates like this seem to push the model toward caring a lot about one of those two things (or some combo) and away from caring about the immediate rewards you were citing earlier as a reason it may not want to take over.
Ah, gotcha. This is definitely a convincing argument that models will learn to value things longer-term (with a lower discount rate), and I shouldn’t have used the phrase “short-term” there. I don’t yet think it’s a convincing argument that the long-term thing it will come to value won’t basically be the long-term version of “make humans smile more”, but you’ve helpfully left another comment on that point, so I’ll shift the discussion there.