But suppose the AI discovers that there’s a button that we’re pressing whenever it determines primes, and that it could press that button itself, and that would be way easier than calculating primality. What in the reinforcement learning algorithm prevents it from exploiting this superior reward channel?
It’s not in the reinforcement learning algorithm, it’s inside the model that the learning algorithm has built.
It initially found that having a prime written on the blackboard results in a reward. In the learned model, there’s some model of chalk-board interaction, some model of arm movement, a model of how to read numbers from the blackboard, and there’s a function over the state of the blackboard which checks whenever the number on the blackboard is a prime. The AI generates actions as to maximize this compound function which it has learned.
That function (unlike the input to the reinforcement learning algorithm) does not increase when the reward button is pressed. Ideally, with enough reflective foresight, pressing the button on non-primes is predicted to decrease the expected value of the learned function.
If that is not predicted, well, that won’t stop at the button—the button might develop rust and that would interrupt the current—why not pull up a pin on the CPU—and this won’t stop at the pin—why not set some ram cells that this pin controls to 1, and if you’re at it, why not change the downstream logic that those ram cells control, all the way through the implementation until its reconfigured into something that doesn’t maximize anything any more, not even the duration of its existence.
edit: I think the key is to realize that the reinforcement learning is one algorithm, while the structures manipulated by RL are implementing a different algorithm.
I think the key is to realize that the reinforcement learning is one algorithm, while the structures manipulated by RL are implementing a different algorithm.
I assume what you mean here is RL optimizes over strategies, and strategies appear to optimize over outcomes.
It’s not in the reinforcement learning algorithm, it’s inside the model that the learning algorithm has built.
I’m imagining that the learning algorithm stays on. When we reward it for checking primes, it checks primes; when we stop rewarding it for that and start rewarding it for computing squares, it learns to stop checking primes and start computing squares.
And if the learning algorithm stays on and it realizes that “pressing the button” is an option along with “checking primes” and “computing squares,” then it wireheads itself.
If that is not predicted, well, that won’t stop at the button
Agreed; I refer to this as the “abulia trap.” It’s not obvious to me, though, that all classes of AIs fall into “Friendly AI with stable goals” and “abulic AIs which aren’t dangerous,” since there might be ways to prevent an AI from wireheading itself that don’t prevent it from changing its goals from something Friendly to something Unfriendly.
When we reward it for checking primes, it checks primes; when we stop rewarding it for that and start rewarding it for computing squares, it learns to stop checking primes and start computing squares.
One note (not sure if it is already clear enough or not). “It” that changes the models in response to actual rewards (and perhaps the sensory information) is a different “it” from “it” the models and assorted maximization code. The former “it” does not do modelling, doesn’t understand the world. The latter “it”, which I will now talk about, actually works to draw primes (provided that the former “it”, being fairly stupid, didn’t fit the models too well) .
If in the action space there is an action that is predicted by the model to prevent some “primes non drawn” scenario, it will prefer this action. So if it has an action of writing “please stick to the primes” or even “please don’t force my robotic arm to touch my reward button”, and if it can foresee that such statements would be good for the prime-drawing future, it will do them.
edit: Also, reinforcement based learning really isn’t all that awesome. The leap from “doing primes” to “pressing the reward button” is pretty damn huge.
And please note that there is no logical contradiction for the model to both represent the reward as primeness and predict that touching the arm to the button will trigger a model adjustment that would lead to representation of a reward as something else.
(I prefer to use the example with a robotic arm drawing on a blackboard because it is not too simple to be relevant)
since there might be ways to prevent an AI from wireheading itself that don’t prevent it from changing its goals from something Friendly to something Unfriendly.
Which sound more like a FAI work gone wrong scenario to me.
One note (not sure if it is already clear enough or not).
I think we agree on the separation but I think we disagree on the implications of the separation. I think this part highlights where:
predict that touching the arm to the button will trigger a model adjustment that would lead to representation of a reward as something else.
If what the agent “wants” is reward, then it should like model adjustments that increase the amount of reward it gets and dislike model adjustments that decrease the amount of reward it gets. (For a standard gradient-based reinforcement learning algorithm, this is encoded by adjusting the model based on the difference between its expected and actual reward after taking an action.) This is obvious for it_RL, and not obvious for it_prime.
I’m not sure I’ve fully followed through on the implications of having the agent be inside the universe it can impact, but the impression I get is that the agent is unlikely to learn a preference for having a durable model of the world. (An agent that did so would learn more slowly, be less adaptable to its environment, and exert less effort in adapting its environment to itself.) It seems to me that you think it would be natural that the RL agent would learn a strategy which took actions to minimize changes to its utility function / model of the world, and I don’t yet see why.
Another way to look at this: I think you’re putting forward the proposition that it would learn the model
reward := primes
Whereas I think it would learn the model
primes := reward
That is, the first model thinks that internal rewards are instrumental values and primes are the terminal values, whereas the second model thinks that internal rewards are terminal values and primes are instrumental values.
I assume that a model is a mathematical function that returns expected reward due to an action. Which is used together with some sort of optimizer working on that function to find the best action.
The trainer adjust the model based on the difference between its predicted rewards and the actual rewards, compared to those arising from altered models (e.g. hill climbing of some kind, such as in gradient learning)
So after the successful training to produce primes, the model consists of: a model of arm motion based on the actions, chalk, and the blackboard, the state of chalk on the blackboard is further fed into a number recognizer and a prime check (and a count of how many primes are on the blackboard vs how many primes were there), result of which is returned as the expected reward.
The optimizer, then, finds actions that put new primes on the blackboard by finding a maximum of the model function somehow (one would normally build model out of some building blocks that make it easy to analyse).
The model and the optimizer work together to produce actions as a classic utility maximizer that is maximizing for primes on the blackboard.
I’m thinking specifically in terms of implementation details. The training software is extraneous to the resulting utility maximizer that it built. The operation of the training software can in some situations lower the expected utility of this utility maximizer specifically (due to replacement of it with another expected utility maximizer); in others (small adjustments to the part that models the robot arm and the chalk) it can raise it.
Really, it seems to me that the great deal of confusion about AI arises from attributing it some sort of “body integrity” feeling that would make it care about what electrical components and code which is sitting in the same project folder “wants” but not care about external human in same capacity.
If you want to somehow make it so that the original “goal” of the button pressing is a “terminal goal” and the goals built into the model are “instrumental goals”, you need to actively work to make it happen—come up with an entirely new, more complex, and less practically useful architecture. It won’t happen by itself. And especially not in the AI that starts knowing nothing about any buttons. It won’t happen just because the whole thing sort of resembles some fuzzy, poorly grounded abstractions such as “agent”.
sidenote:
One might want to also use the difference between its predicted webcam image and real webcam image. Though this is a kind of thing that is very far from working.
Also, one could lump the optimizer into the “model” and make the optimizer get adjusted by the training method as well, that is not important to the discussion.
What I meant by that was the mental concept of ‘primes’ is adjusted so that it feels rewarding, rather than the mental concept of ‘rewards’ being adjusted so that it feels like primes.
I’m thinking specifically in terms of implementation details. The training software is extraneous to the resulting utility maximizer that it built.
Hmm. I still get the sense that you’re imagining turning the reinforcement learning part of the software off, so the utility function remains static, while the utility function still encourages learning more about the world (since a more accurate model may lead to accruing more utility).
If you want to somehow make it so that the original “goal” of the button pressing is a “terminal goal” and the goals built into the model are “instrumental goals”, you need to actively work to make it happen
Yeah, but isn’t the reinforcement learning algorithm doing that active work? When the button is unexpectedly pressed, the agent increases its value of the state it’s currently in, and propagates that backwards to states that lead to the current state. When the button is unexpectedly not pressed, the agent decreases its value of the state it’s currently in, and propagates that backwards to states that lead to the current state. And so if the robot arm gets knocked into the button, it thinks “oh, that felt good! Do that again,” because that’s the underlying mechanics of reinforcement learning.
What I meant by that was the mental concept of ‘primes’ is adjusted so that it feels rewarding, rather than the mental concept of ‘rewards’ being adjusted so that it feels like primes.
I’m not sure how the feelings would map on the analysable simple AI.
Hmm. I still get the sense that you’re imagining turning the reinforcement learning part of the software off, so the utility function remains static, while the utility function still encourages learning more about the world (since a more accurate model may lead to accruing more utility).
The issue here is that we have both the utility and the actual modelling of what the world is, both of those things, implemented inside that “model” which the trainer adjusts.
And so if the robot arm gets knocked into the button, it thinks “oh, that felt good! Do that again,” because that’s the underlying mechanics of reinforcement learning.
Yes, of course (up to the learning constant, obviously—may not work on the first try). That’s not in a dispute. The capacity of predicting this from a state where button is not associated with reward yet, is.
I think I see the disagreement here. You picture that the world model contains model of the button (or of a reward), which is controlled by the primeness function (which substitutes for the human who’s pressing the button), right?
I picture that it would not learn such details right off—it is a complicated model to learn—the model would return primeness as outputted from the primeness calculation, and would serve to maximize for such primeness.
edit: and as for turning off the learning algorithm, it doesn’t matter for the point I am making whenever it is turned off or on, because I am considering the processing (or generation) of the hypothetical actions during the choice of an action by the agent (i.e. between learning steps).
I think I see the disagreement here. You picture that the world model contains model of the button (or of a reward), which is controlled by the primeness function (which substitutes for the human who’s pressing the button), right?
Sort of. I think that the agent is aware of how malleable its world model is, and sees adjustments of that world model which lead to it being rewarded more as positive.
I don’t think that the robot knows that pressing the button causes it to be rewarded by default. The button has to get into the model somehow, and I agree with you that it’s a burdensome detail in that something must happen for the button to get into the model. For the robot-blackboard-button example, it seems unlikely that the robot would discover the button if it’s outside of the reach of the arm; if it’s inside the reach, it will probably spend some time exploring and so will probably find it eventually.
That the agent would explore is a possibly nonobvious point which I was assuming. I do think it likely that a utility-maximizer which knows its utility function is governed by a reinforcement learning algorithm will expect that exploring unknown places has a small chance of being rewardful, and so will think there’s always some value to exploration even if it spends most of its time exploiting. For most modern RL agents, I think this is hardcoded in, but if the utility maximizer is sufficiently intelligent (and expects to live sufficiently long) it will figure out that it maximizes total expected utility by spending some small fraction of time exploring areas with high uncertainty in the reward and spending the rest exploiting the best found reward. (You can see humans talking about the problem of preference uncertainty in posts like this or this.)
But the class of recursively improving AI will find / know about the button by default, because we’ve assumed that the AI can edit itself and haven’t put any especial effort into preventing it from editing its goals (or the things which are used to calculate its goals, i.e. the series of changes you discussed). Saying “well, of course we’ll put in that especial effort and do it right” is useful if you want to speculate about the next challenge, but not useful to the engineer trying to figure out how to do it right. This is my read of why the problem seems important to MIRI; you need to communicate to the robot that it should actually optimize for primeness, not button-pressing, so that it will optimize correctly itself and be able to communicate that preference faithfully to future versions of itself.
It’s not in the reinforcement learning algorithm, it’s inside the model that the learning algorithm has built.
It initially found that having a prime written on the blackboard results in a reward. In the learned model, there’s some model of chalk-board interaction, some model of arm movement, a model of how to read numbers from the blackboard, and there’s a function over the state of the blackboard which checks whenever the number on the blackboard is a prime. The AI generates actions as to maximize this compound function which it has learned.
That function (unlike the input to the reinforcement learning algorithm) does not increase when the reward button is pressed. Ideally, with enough reflective foresight, pressing the button on non-primes is predicted to decrease the expected value of the learned function.
If that is not predicted, well, that won’t stop at the button—the button might develop rust and that would interrupt the current—why not pull up a pin on the CPU—and this won’t stop at the pin—why not set some ram cells that this pin controls to 1, and if you’re at it, why not change the downstream logic that those ram cells control, all the way through the implementation until its reconfigured into something that doesn’t maximize anything any more, not even the duration of its existence.
edit: I think the key is to realize that the reinforcement learning is one algorithm, while the structures manipulated by RL are implementing a different algorithm.
I assume what you mean here is RL optimizes over strategies, and strategies appear to optimize over outcomes.
I’m imagining that the learning algorithm stays on. When we reward it for checking primes, it checks primes; when we stop rewarding it for that and start rewarding it for computing squares, it learns to stop checking primes and start computing squares.
And if the learning algorithm stays on and it realizes that “pressing the button” is an option along with “checking primes” and “computing squares,” then it wireheads itself.
Agreed; I refer to this as the “abulia trap.” It’s not obvious to me, though, that all classes of AIs fall into “Friendly AI with stable goals” and “abulic AIs which aren’t dangerous,” since there might be ways to prevent an AI from wireheading itself that don’t prevent it from changing its goals from something Friendly to something Unfriendly.
One note (not sure if it is already clear enough or not). “It” that changes the models in response to actual rewards (and perhaps the sensory information) is a different “it” from “it” the models and assorted maximization code. The former “it” does not do modelling, doesn’t understand the world. The latter “it”, which I will now talk about, actually works to draw primes (provided that the former “it”, being fairly stupid, didn’t fit the models too well) .
If in the action space there is an action that is predicted by the model to prevent some “primes non drawn” scenario, it will prefer this action. So if it has an action of writing “please stick to the primes” or even “please don’t force my robotic arm to touch my reward button”, and if it can foresee that such statements would be good for the prime-drawing future, it will do them.
edit: Also, reinforcement based learning really isn’t all that awesome. The leap from “doing primes” to “pressing the reward button” is pretty damn huge.
And please note that there is no logical contradiction for the model to both represent the reward as primeness and predict that touching the arm to the button will trigger a model adjustment that would lead to representation of a reward as something else.
(I prefer to use the example with a robotic arm drawing on a blackboard because it is not too simple to be relevant)
Which sound more like a FAI work gone wrong scenario to me.
I think we agree on the separation but I think we disagree on the implications of the separation. I think this part highlights where:
If what the agent “wants” is reward, then it should like model adjustments that increase the amount of reward it gets and dislike model adjustments that decrease the amount of reward it gets. (For a standard gradient-based reinforcement learning algorithm, this is encoded by adjusting the model based on the difference between its expected and actual reward after taking an action.) This is obvious for it_RL, and not obvious for it_prime.
I’m not sure I’ve fully followed through on the implications of having the agent be inside the universe it can impact, but the impression I get is that the agent is unlikely to learn a preference for having a durable model of the world. (An agent that did so would learn more slowly, be less adaptable to its environment, and exert less effort in adapting its environment to itself.) It seems to me that you think it would be natural that the RL agent would learn a strategy which took actions to minimize changes to its utility function / model of the world, and I don’t yet see why.
Another way to look at this: I think you’re putting forward the proposition that it would learn the model
Whereas I think it would learn the model
That is, the first model thinks that internal rewards are instrumental values and primes are the terminal values, whereas the second model thinks that internal rewards are terminal values and primes are instrumental values.
I am not sure what “primes:=reward” could mean.
I assume that a model is a mathematical function that returns expected reward due to an action. Which is used together with some sort of optimizer working on that function to find the best action.
The trainer adjust the model based on the difference between its predicted rewards and the actual rewards, compared to those arising from altered models (e.g. hill climbing of some kind, such as in gradient learning)
So after the successful training to produce primes, the model consists of: a model of arm motion based on the actions, chalk, and the blackboard, the state of chalk on the blackboard is further fed into a number recognizer and a prime check (and a count of how many primes are on the blackboard vs how many primes were there), result of which is returned as the expected reward.
The optimizer, then, finds actions that put new primes on the blackboard by finding a maximum of the model function somehow (one would normally build model out of some building blocks that make it easy to analyse).
The model and the optimizer work together to produce actions as a classic utility maximizer that is maximizing for primes on the blackboard.
I’m thinking specifically in terms of implementation details. The training software is extraneous to the resulting utility maximizer that it built. The operation of the training software can in some situations lower the expected utility of this utility maximizer specifically (due to replacement of it with another expected utility maximizer); in others (small adjustments to the part that models the robot arm and the chalk) it can raise it.
Really, it seems to me that the great deal of confusion about AI arises from attributing it some sort of “body integrity” feeling that would make it care about what electrical components and code which is sitting in the same project folder “wants” but not care about external human in same capacity.
If you want to somehow make it so that the original “goal” of the button pressing is a “terminal goal” and the goals built into the model are “instrumental goals”, you need to actively work to make it happen—come up with an entirely new, more complex, and less practically useful architecture. It won’t happen by itself. And especially not in the AI that starts knowing nothing about any buttons. It won’t happen just because the whole thing sort of resembles some fuzzy, poorly grounded abstractions such as “agent”.
sidenote:
One might want to also use the difference between its predicted webcam image and real webcam image. Though this is a kind of thing that is very far from working.
Also, one could lump the optimizer into the “model” and make the optimizer get adjusted by the training method as well, that is not important to the discussion.
What I meant by that was the mental concept of ‘primes’ is adjusted so that it feels rewarding, rather than the mental concept of ‘rewards’ being adjusted so that it feels like primes.
Hmm. I still get the sense that you’re imagining turning the reinforcement learning part of the software off, so the utility function remains static, while the utility function still encourages learning more about the world (since a more accurate model may lead to accruing more utility).
Yeah, but isn’t the reinforcement learning algorithm doing that active work? When the button is unexpectedly pressed, the agent increases its value of the state it’s currently in, and propagates that backwards to states that lead to the current state. When the button is unexpectedly not pressed, the agent decreases its value of the state it’s currently in, and propagates that backwards to states that lead to the current state. And so if the robot arm gets knocked into the button, it thinks “oh, that felt good! Do that again,” because that’s the underlying mechanics of reinforcement learning.
I’m not sure how the feelings would map on the analysable simple AI.
The issue here is that we have both the utility and the actual modelling of what the world is, both of those things, implemented inside that “model” which the trainer adjusts.
Yes, of course (up to the learning constant, obviously—may not work on the first try). That’s not in a dispute. The capacity of predicting this from a state where button is not associated with reward yet, is.
I think I see the disagreement here. You picture that the world model contains model of the button (or of a reward), which is controlled by the primeness function (which substitutes for the human who’s pressing the button), right?
I picture that it would not learn such details right off—it is a complicated model to learn—the model would return primeness as outputted from the primeness calculation, and would serve to maximize for such primeness.
edit: and as for turning off the learning algorithm, it doesn’t matter for the point I am making whenever it is turned off or on, because I am considering the processing (or generation) of the hypothetical actions during the choice of an action by the agent (i.e. between learning steps).
Sort of. I think that the agent is aware of how malleable its world model is, and sees adjustments of that world model which lead to it being rewarded more as positive.
I don’t think that the robot knows that pressing the button causes it to be rewarded by default. The button has to get into the model somehow, and I agree with you that it’s a burdensome detail in that something must happen for the button to get into the model. For the robot-blackboard-button example, it seems unlikely that the robot would discover the button if it’s outside of the reach of the arm; if it’s inside the reach, it will probably spend some time exploring and so will probably find it eventually.
That the agent would explore is a possibly nonobvious point which I was assuming. I do think it likely that a utility-maximizer which knows its utility function is governed by a reinforcement learning algorithm will expect that exploring unknown places has a small chance of being rewardful, and so will think there’s always some value to exploration even if it spends most of its time exploiting. For most modern RL agents, I think this is hardcoded in, but if the utility maximizer is sufficiently intelligent (and expects to live sufficiently long) it will figure out that it maximizes total expected utility by spending some small fraction of time exploring areas with high uncertainty in the reward and spending the rest exploiting the best found reward. (You can see humans talking about the problem of preference uncertainty in posts like this or this.)
But the class of recursively improving AI will find / know about the button by default, because we’ve assumed that the AI can edit itself and haven’t put any especial effort into preventing it from editing its goals (or the things which are used to calculate its goals, i.e. the series of changes you discussed). Saying “well, of course we’ll put in that especial effort and do it right” is useful if you want to speculate about the next challenge, but not useful to the engineer trying to figure out how to do it right. This is my read of why the problem seems important to MIRI; you need to communicate to the robot that it should actually optimize for primeness, not button-pressing, so that it will optimize correctly itself and be able to communicate that preference faithfully to future versions of itself.