I assume that a model is a mathematical function that returns expected reward due to an action. Which is used together with some sort of optimizer working on that function to find the best action.
The trainer adjust the model based on the difference between its predicted rewards and the actual rewards, compared to those arising from altered models (e.g. hill climbing of some kind, such as in gradient learning)
So after the successful training to produce primes, the model consists of: a model of arm motion based on the actions, chalk, and the blackboard, the state of chalk on the blackboard is further fed into a number recognizer and a prime check (and a count of how many primes are on the blackboard vs how many primes were there), result of which is returned as the expected reward.
The optimizer, then, finds actions that put new primes on the blackboard by finding a maximum of the model function somehow (one would normally build model out of some building blocks that make it easy to analyse).
The model and the optimizer work together to produce actions as a classic utility maximizer that is maximizing for primes on the blackboard.
I’m thinking specifically in terms of implementation details. The training software is extraneous to the resulting utility maximizer that it built. The operation of the training software can in some situations lower the expected utility of this utility maximizer specifically (due to replacement of it with another expected utility maximizer); in others (small adjustments to the part that models the robot arm and the chalk) it can raise it.
Really, it seems to me that the great deal of confusion about AI arises from attributing it some sort of “body integrity” feeling that would make it care about what electrical components and code which is sitting in the same project folder “wants” but not care about external human in same capacity.
If you want to somehow make it so that the original “goal” of the button pressing is a “terminal goal” and the goals built into the model are “instrumental goals”, you need to actively work to make it happen—come up with an entirely new, more complex, and less practically useful architecture. It won’t happen by itself. And especially not in the AI that starts knowing nothing about any buttons. It won’t happen just because the whole thing sort of resembles some fuzzy, poorly grounded abstractions such as “agent”.
sidenote:
One might want to also use the difference between its predicted webcam image and real webcam image. Though this is a kind of thing that is very far from working.
Also, one could lump the optimizer into the “model” and make the optimizer get adjusted by the training method as well, that is not important to the discussion.
What I meant by that was the mental concept of ‘primes’ is adjusted so that it feels rewarding, rather than the mental concept of ‘rewards’ being adjusted so that it feels like primes.
I’m thinking specifically in terms of implementation details. The training software is extraneous to the resulting utility maximizer that it built.
Hmm. I still get the sense that you’re imagining turning the reinforcement learning part of the software off, so the utility function remains static, while the utility function still encourages learning more about the world (since a more accurate model may lead to accruing more utility).
If you want to somehow make it so that the original “goal” of the button pressing is a “terminal goal” and the goals built into the model are “instrumental goals”, you need to actively work to make it happen
Yeah, but isn’t the reinforcement learning algorithm doing that active work? When the button is unexpectedly pressed, the agent increases its value of the state it’s currently in, and propagates that backwards to states that lead to the current state. When the button is unexpectedly not pressed, the agent decreases its value of the state it’s currently in, and propagates that backwards to states that lead to the current state. And so if the robot arm gets knocked into the button, it thinks “oh, that felt good! Do that again,” because that’s the underlying mechanics of reinforcement learning.
What I meant by that was the mental concept of ‘primes’ is adjusted so that it feels rewarding, rather than the mental concept of ‘rewards’ being adjusted so that it feels like primes.
I’m not sure how the feelings would map on the analysable simple AI.
Hmm. I still get the sense that you’re imagining turning the reinforcement learning part of the software off, so the utility function remains static, while the utility function still encourages learning more about the world (since a more accurate model may lead to accruing more utility).
The issue here is that we have both the utility and the actual modelling of what the world is, both of those things, implemented inside that “model” which the trainer adjusts.
And so if the robot arm gets knocked into the button, it thinks “oh, that felt good! Do that again,” because that’s the underlying mechanics of reinforcement learning.
Yes, of course (up to the learning constant, obviously—may not work on the first try). That’s not in a dispute. The capacity of predicting this from a state where button is not associated with reward yet, is.
I think I see the disagreement here. You picture that the world model contains model of the button (or of a reward), which is controlled by the primeness function (which substitutes for the human who’s pressing the button), right?
I picture that it would not learn such details right off—it is a complicated model to learn—the model would return primeness as outputted from the primeness calculation, and would serve to maximize for such primeness.
edit: and as for turning off the learning algorithm, it doesn’t matter for the point I am making whenever it is turned off or on, because I am considering the processing (or generation) of the hypothetical actions during the choice of an action by the agent (i.e. between learning steps).
I think I see the disagreement here. You picture that the world model contains model of the button (or of a reward), which is controlled by the primeness function (which substitutes for the human who’s pressing the button), right?
Sort of. I think that the agent is aware of how malleable its world model is, and sees adjustments of that world model which lead to it being rewarded more as positive.
I don’t think that the robot knows that pressing the button causes it to be rewarded by default. The button has to get into the model somehow, and I agree with you that it’s a burdensome detail in that something must happen for the button to get into the model. For the robot-blackboard-button example, it seems unlikely that the robot would discover the button if it’s outside of the reach of the arm; if it’s inside the reach, it will probably spend some time exploring and so will probably find it eventually.
That the agent would explore is a possibly nonobvious point which I was assuming. I do think it likely that a utility-maximizer which knows its utility function is governed by a reinforcement learning algorithm will expect that exploring unknown places has a small chance of being rewardful, and so will think there’s always some value to exploration even if it spends most of its time exploiting. For most modern RL agents, I think this is hardcoded in, but if the utility maximizer is sufficiently intelligent (and expects to live sufficiently long) it will figure out that it maximizes total expected utility by spending some small fraction of time exploring areas with high uncertainty in the reward and spending the rest exploiting the best found reward. (You can see humans talking about the problem of preference uncertainty in posts like this or this.)
But the class of recursively improving AI will find / know about the button by default, because we’ve assumed that the AI can edit itself and haven’t put any especial effort into preventing it from editing its goals (or the things which are used to calculate its goals, i.e. the series of changes you discussed). Saying “well, of course we’ll put in that especial effort and do it right” is useful if you want to speculate about the next challenge, but not useful to the engineer trying to figure out how to do it right. This is my read of why the problem seems important to MIRI; you need to communicate to the robot that it should actually optimize for primeness, not button-pressing, so that it will optimize correctly itself and be able to communicate that preference faithfully to future versions of itself.
I am not sure what “primes:=reward” could mean.
I assume that a model is a mathematical function that returns expected reward due to an action. Which is used together with some sort of optimizer working on that function to find the best action.
The trainer adjust the model based on the difference between its predicted rewards and the actual rewards, compared to those arising from altered models (e.g. hill climbing of some kind, such as in gradient learning)
So after the successful training to produce primes, the model consists of: a model of arm motion based on the actions, chalk, and the blackboard, the state of chalk on the blackboard is further fed into a number recognizer and a prime check (and a count of how many primes are on the blackboard vs how many primes were there), result of which is returned as the expected reward.
The optimizer, then, finds actions that put new primes on the blackboard by finding a maximum of the model function somehow (one would normally build model out of some building blocks that make it easy to analyse).
The model and the optimizer work together to produce actions as a classic utility maximizer that is maximizing for primes on the blackboard.
I’m thinking specifically in terms of implementation details. The training software is extraneous to the resulting utility maximizer that it built. The operation of the training software can in some situations lower the expected utility of this utility maximizer specifically (due to replacement of it with another expected utility maximizer); in others (small adjustments to the part that models the robot arm and the chalk) it can raise it.
Really, it seems to me that the great deal of confusion about AI arises from attributing it some sort of “body integrity” feeling that would make it care about what electrical components and code which is sitting in the same project folder “wants” but not care about external human in same capacity.
If you want to somehow make it so that the original “goal” of the button pressing is a “terminal goal” and the goals built into the model are “instrumental goals”, you need to actively work to make it happen—come up with an entirely new, more complex, and less practically useful architecture. It won’t happen by itself. And especially not in the AI that starts knowing nothing about any buttons. It won’t happen just because the whole thing sort of resembles some fuzzy, poorly grounded abstractions such as “agent”.
sidenote:
One might want to also use the difference between its predicted webcam image and real webcam image. Though this is a kind of thing that is very far from working.
Also, one could lump the optimizer into the “model” and make the optimizer get adjusted by the training method as well, that is not important to the discussion.
What I meant by that was the mental concept of ‘primes’ is adjusted so that it feels rewarding, rather than the mental concept of ‘rewards’ being adjusted so that it feels like primes.
Hmm. I still get the sense that you’re imagining turning the reinforcement learning part of the software off, so the utility function remains static, while the utility function still encourages learning more about the world (since a more accurate model may lead to accruing more utility).
Yeah, but isn’t the reinforcement learning algorithm doing that active work? When the button is unexpectedly pressed, the agent increases its value of the state it’s currently in, and propagates that backwards to states that lead to the current state. When the button is unexpectedly not pressed, the agent decreases its value of the state it’s currently in, and propagates that backwards to states that lead to the current state. And so if the robot arm gets knocked into the button, it thinks “oh, that felt good! Do that again,” because that’s the underlying mechanics of reinforcement learning.
I’m not sure how the feelings would map on the analysable simple AI.
The issue here is that we have both the utility and the actual modelling of what the world is, both of those things, implemented inside that “model” which the trainer adjusts.
Yes, of course (up to the learning constant, obviously—may not work on the first try). That’s not in a dispute. The capacity of predicting this from a state where button is not associated with reward yet, is.
I think I see the disagreement here. You picture that the world model contains model of the button (or of a reward), which is controlled by the primeness function (which substitutes for the human who’s pressing the button), right?
I picture that it would not learn such details right off—it is a complicated model to learn—the model would return primeness as outputted from the primeness calculation, and would serve to maximize for such primeness.
edit: and as for turning off the learning algorithm, it doesn’t matter for the point I am making whenever it is turned off or on, because I am considering the processing (or generation) of the hypothetical actions during the choice of an action by the agent (i.e. between learning steps).
Sort of. I think that the agent is aware of how malleable its world model is, and sees adjustments of that world model which lead to it being rewarded more as positive.
I don’t think that the robot knows that pressing the button causes it to be rewarded by default. The button has to get into the model somehow, and I agree with you that it’s a burdensome detail in that something must happen for the button to get into the model. For the robot-blackboard-button example, it seems unlikely that the robot would discover the button if it’s outside of the reach of the arm; if it’s inside the reach, it will probably spend some time exploring and so will probably find it eventually.
That the agent would explore is a possibly nonobvious point which I was assuming. I do think it likely that a utility-maximizer which knows its utility function is governed by a reinforcement learning algorithm will expect that exploring unknown places has a small chance of being rewardful, and so will think there’s always some value to exploration even if it spends most of its time exploiting. For most modern RL agents, I think this is hardcoded in, but if the utility maximizer is sufficiently intelligent (and expects to live sufficiently long) it will figure out that it maximizes total expected utility by spending some small fraction of time exploring areas with high uncertainty in the reward and spending the rest exploiting the best found reward. (You can see humans talking about the problem of preference uncertainty in posts like this or this.)
But the class of recursively improving AI will find / know about the button by default, because we’ve assumed that the AI can edit itself and haven’t put any especial effort into preventing it from editing its goals (or the things which are used to calculate its goals, i.e. the series of changes you discussed). Saying “well, of course we’ll put in that especial effort and do it right” is useful if you want to speculate about the next challenge, but not useful to the engineer trying to figure out how to do it right. This is my read of why the problem seems important to MIRI; you need to communicate to the robot that it should actually optimize for primeness, not button-pressing, so that it will optimize correctly itself and be able to communicate that preference faithfully to future versions of itself.