What I meant by that was the mental concept of ‘primes’ is adjusted so that it feels rewarding, rather than the mental concept of ‘rewards’ being adjusted so that it feels like primes.
I’m thinking specifically in terms of implementation details. The training software is extraneous to the resulting utility maximizer that it built.
Hmm. I still get the sense that you’re imagining turning the reinforcement learning part of the software off, so the utility function remains static, while the utility function still encourages learning more about the world (since a more accurate model may lead to accruing more utility).
If you want to somehow make it so that the original “goal” of the button pressing is a “terminal goal” and the goals built into the model are “instrumental goals”, you need to actively work to make it happen
Yeah, but isn’t the reinforcement learning algorithm doing that active work? When the button is unexpectedly pressed, the agent increases its value of the state it’s currently in, and propagates that backwards to states that lead to the current state. When the button is unexpectedly not pressed, the agent decreases its value of the state it’s currently in, and propagates that backwards to states that lead to the current state. And so if the robot arm gets knocked into the button, it thinks “oh, that felt good! Do that again,” because that’s the underlying mechanics of reinforcement learning.
What I meant by that was the mental concept of ‘primes’ is adjusted so that it feels rewarding, rather than the mental concept of ‘rewards’ being adjusted so that it feels like primes.
I’m not sure how the feelings would map on the analysable simple AI.
Hmm. I still get the sense that you’re imagining turning the reinforcement learning part of the software off, so the utility function remains static, while the utility function still encourages learning more about the world (since a more accurate model may lead to accruing more utility).
The issue here is that we have both the utility and the actual modelling of what the world is, both of those things, implemented inside that “model” which the trainer adjusts.
And so if the robot arm gets knocked into the button, it thinks “oh, that felt good! Do that again,” because that’s the underlying mechanics of reinforcement learning.
Yes, of course (up to the learning constant, obviously—may not work on the first try). That’s not in a dispute. The capacity of predicting this from a state where button is not associated with reward yet, is.
I think I see the disagreement here. You picture that the world model contains model of the button (or of a reward), which is controlled by the primeness function (which substitutes for the human who’s pressing the button), right?
I picture that it would not learn such details right off—it is a complicated model to learn—the model would return primeness as outputted from the primeness calculation, and would serve to maximize for such primeness.
edit: and as for turning off the learning algorithm, it doesn’t matter for the point I am making whenever it is turned off or on, because I am considering the processing (or generation) of the hypothetical actions during the choice of an action by the agent (i.e. between learning steps).
I think I see the disagreement here. You picture that the world model contains model of the button (or of a reward), which is controlled by the primeness function (which substitutes for the human who’s pressing the button), right?
Sort of. I think that the agent is aware of how malleable its world model is, and sees adjustments of that world model which lead to it being rewarded more as positive.
I don’t think that the robot knows that pressing the button causes it to be rewarded by default. The button has to get into the model somehow, and I agree with you that it’s a burdensome detail in that something must happen for the button to get into the model. For the robot-blackboard-button example, it seems unlikely that the robot would discover the button if it’s outside of the reach of the arm; if it’s inside the reach, it will probably spend some time exploring and so will probably find it eventually.
That the agent would explore is a possibly nonobvious point which I was assuming. I do think it likely that a utility-maximizer which knows its utility function is governed by a reinforcement learning algorithm will expect that exploring unknown places has a small chance of being rewardful, and so will think there’s always some value to exploration even if it spends most of its time exploiting. For most modern RL agents, I think this is hardcoded in, but if the utility maximizer is sufficiently intelligent (and expects to live sufficiently long) it will figure out that it maximizes total expected utility by spending some small fraction of time exploring areas with high uncertainty in the reward and spending the rest exploiting the best found reward. (You can see humans talking about the problem of preference uncertainty in posts like this or this.)
But the class of recursively improving AI will find / know about the button by default, because we’ve assumed that the AI can edit itself and haven’t put any especial effort into preventing it from editing its goals (or the things which are used to calculate its goals, i.e. the series of changes you discussed). Saying “well, of course we’ll put in that especial effort and do it right” is useful if you want to speculate about the next challenge, but not useful to the engineer trying to figure out how to do it right. This is my read of why the problem seems important to MIRI; you need to communicate to the robot that it should actually optimize for primeness, not button-pressing, so that it will optimize correctly itself and be able to communicate that preference faithfully to future versions of itself.
What I meant by that was the mental concept of ‘primes’ is adjusted so that it feels rewarding, rather than the mental concept of ‘rewards’ being adjusted so that it feels like primes.
Hmm. I still get the sense that you’re imagining turning the reinforcement learning part of the software off, so the utility function remains static, while the utility function still encourages learning more about the world (since a more accurate model may lead to accruing more utility).
Yeah, but isn’t the reinforcement learning algorithm doing that active work? When the button is unexpectedly pressed, the agent increases its value of the state it’s currently in, and propagates that backwards to states that lead to the current state. When the button is unexpectedly not pressed, the agent decreases its value of the state it’s currently in, and propagates that backwards to states that lead to the current state. And so if the robot arm gets knocked into the button, it thinks “oh, that felt good! Do that again,” because that’s the underlying mechanics of reinforcement learning.
I’m not sure how the feelings would map on the analysable simple AI.
The issue here is that we have both the utility and the actual modelling of what the world is, both of those things, implemented inside that “model” which the trainer adjusts.
Yes, of course (up to the learning constant, obviously—may not work on the first try). That’s not in a dispute. The capacity of predicting this from a state where button is not associated with reward yet, is.
I think I see the disagreement here. You picture that the world model contains model of the button (or of a reward), which is controlled by the primeness function (which substitutes for the human who’s pressing the button), right?
I picture that it would not learn such details right off—it is a complicated model to learn—the model would return primeness as outputted from the primeness calculation, and would serve to maximize for such primeness.
edit: and as for turning off the learning algorithm, it doesn’t matter for the point I am making whenever it is turned off or on, because I am considering the processing (or generation) of the hypothetical actions during the choice of an action by the agent (i.e. between learning steps).
Sort of. I think that the agent is aware of how malleable its world model is, and sees adjustments of that world model which lead to it being rewarded more as positive.
I don’t think that the robot knows that pressing the button causes it to be rewarded by default. The button has to get into the model somehow, and I agree with you that it’s a burdensome detail in that something must happen for the button to get into the model. For the robot-blackboard-button example, it seems unlikely that the robot would discover the button if it’s outside of the reach of the arm; if it’s inside the reach, it will probably spend some time exploring and so will probably find it eventually.
That the agent would explore is a possibly nonobvious point which I was assuming. I do think it likely that a utility-maximizer which knows its utility function is governed by a reinforcement learning algorithm will expect that exploring unknown places has a small chance of being rewardful, and so will think there’s always some value to exploration even if it spends most of its time exploiting. For most modern RL agents, I think this is hardcoded in, but if the utility maximizer is sufficiently intelligent (and expects to live sufficiently long) it will figure out that it maximizes total expected utility by spending some small fraction of time exploring areas with high uncertainty in the reward and spending the rest exploiting the best found reward. (You can see humans talking about the problem of preference uncertainty in posts like this or this.)
But the class of recursively improving AI will find / know about the button by default, because we’ve assumed that the AI can edit itself and haven’t put any especial effort into preventing it from editing its goals (or the things which are used to calculate its goals, i.e. the series of changes you discussed). Saying “well, of course we’ll put in that especial effort and do it right” is useful if you want to speculate about the next challenge, but not useful to the engineer trying to figure out how to do it right. This is my read of why the problem seems important to MIRI; you need to communicate to the robot that it should actually optimize for primeness, not button-pressing, so that it will optimize correctly itself and be able to communicate that preference faithfully to future versions of itself.