One of my conclusions was that you could, in theory, train a Solomonoff Induction-based reinforcement learning agent to produce arbitrary finite sequences of actions (non-self-destructive ones anyway) in response to specified sets of finite sense data—assuming you are allowed to program its reward function and give it fake memories dating back from before it was born.
This is essentially the same result as is claimed for O-Maximisers in the paper. This undermines the thesis that O-Maximisers somehow exhibit different dynamics from reinforcement learning agents.
Update on 2011-04-30: Bill Hibbard makes an almost identical point to the observations I made in this comment. You can see it in his post—on the AGI mailing list—here.
Thanks for posting this around! It’s great to see it creating discussion.
I’m working on replies to the points you, Bill Hibbard, and Curt Welch have made. It looks like I have some explaining to do if I want to convince you that O-maximizers aren’t a subset of reward maximizers—in particular, that my argument in appendix B doesn’t apply to O-maximizers.
It looks like I have some explaining to do if I want to convince you that O-maximizers aren’t a subset of reward maximizers—in particular, that my argument in appendix B doesn’t apply to O-maximizers.
To recap, my position is that both expected reward maximisers and expected utility maximisers are universal learners—and so can perform practically any series of non-self-destructive actions in a configurable manner in response to inputs. So, I don’t think either system necessarily exhibits the “characteristic behaviour” you describe.
Sadly, what he seems to have failed to realize, is that any actual implementation of an O-Maximizer or his Value-learners must also be reward maximizerr. Is he really that stupid so as not to understand they are all reward maximizer?
Zing! I guess he didn’t think I was going to be reading that. To be fair, it may seem to him that I’ve made a stupid error, thinking that O-maximizers behave differently than reward maximizers. I’ll try to explain why he’s mistaken.
A reward maximizer acts so as to bring about universes in which the rewards it receives are maximized. For this reason, it will predict and may manipulate the future actions of its rewarder.
An O-maximizer with utility function U acts so as to bring about universes which score highly according to U. For this reason, it is quite unlikely to manipulate or alter its utility function, unless its utility function directly values universes in which it self-alters.
In particular, note that an O-maximizer does not act so as to bring about universes in which the utility it assigns to the universe is maximized. Where the reward maximizer predicts and “cares about” what the rewarder will say tomorrow, an O-maximizer uses its current utility function to evaluate futures and choose actions.
O-maximizers and reward maximizers have different relationships with their “motivators” (utility function vs. rewarder), and they behave differently when given the option to alter their motivators. It seems clear to me that they are distinct.
The only difference is in the algorithm it uses to calculate the “expected value”. Dose he not understand that if you build a machine to do this, that there must be hardware in the machine that calculates that expected value? And that such a machine can then be seen as two machines, one which is calculating the expected value, and the other which is picking actions to maximize the output of that calculation? And once you have that machine, his argument of appendix B once again applies?
Actually trying to apply the argument in Appendix B to an O-maximizer, implemented or in the abstract, using the definitions given in the paper instead of reasoning by analogy, is sufficient to show that this is also incorrect.
An agent of unbounded intelligent will always reach a point of understanding he has the option to try and modify the reward function which means the wirehead problem is always on the table.
It may have the option, but will it be motivated to alter its “reward function”? Consider an O-maximizer with utility function U. It acts to maximize the universe’s utility as measured by U. How would the agent’s alteration of its own utility function bring about universes that score highly according to U?
A reward maximizer acts so as to bring about universes in which the rewards it receives are maximized. For this reason, it will predict and may manipulate the future actions of its rewarder.
An O-maximizer with utility function U acts so as to bring about universes which score highly according to U. For this reason, it is quite unlikely to manipulate or alter its utility function
The more obvious problem for utility maximisers is fake utility.
Actually trying to apply the argument in Appendix B to an O-maximizer [...] is sufficient to show that this is also incorrect.
My position here is a bit different from Curt’s. Curt will argue that both systems are likely to wirehead (and I don’t necessarily disagree—the set-up in the paper is not sufficient to prevent wireheading, IMO). My angle is more that both types of systems can be made into universal agents—producing arbitrary finite action sequenes in response to whatever inputs you like.
The more obvious problem for utility maximisers is fake utility.
...but your characterisation of the behaviour of reward maximizers and utility maximisers seems ratther like a projection to me. IMO, actual behaviour will depend on what the systems believe their purpose is when they come to adjusting their brains. Since they both lack knowledge of the design purpose of their own goal systems, ISTM that the outcome could potentially vary. Maybe they will wirehead, maybe they won’t.
Ah, I see. Thanks for taking the time to discuss this—you’ve raised some helpful points about how my argument will need to be strengthened (“universal action” is good food for thought) and clarified (clearly, my account of wireheading is unconvincing).
The paper’s been accepted, and I have a ton of editing to do (need to cut four pages!), so I may not be very quick to respond for the time being. I didn’t want to disappear without warning, and without saying thanks for your time!
OK. I am skepical that the wirehead problem can be solved simply by invoking expected utillity maximisation. IMO, there are at least two problems that go beyond that:
How do you tell the system to maximise (say) temperature—and not some kind of proxy or perception of temperature?
How do you construct a practical inductive inference engine without using reinforcement learning?
FWIW, my current position is that this probably isn’t our problem. The wirehead problem doesn’t become serious until relatively late on—leaving plenty of scope for transforming the world into a smarter place in the mean time.
It seems to me that every O-maximizer can be expressed as a reward maximizer. Specifically, comparing equations (2) and (3), given an O-maximizer we can define reward r sub(m) (by this notation I mean “r subscript m”) as:
r sub(m) = SUM(r in R) U(r)P(r|yx sub(<=m))
and r sub(i) = 0 for i<m, where the paper sets m to the final time step, following Nick Hay. The reward maximizer so defined will behave identically with the O-maximizer.
In the reward-maximization framework, rewards are part of observations and come from the environment. You cannot define “r sub(m)” to be equal to something mathematically, then call the result a reward-maximizer; therefore, Hibbard’s formulation of an O-maximizer as a reward-maximizer doesn’t work.
If this is correct, doesn’t the “characteristic behavior pattern” shown for reward maximizers in Appendix B, as stated in Section 3.1, also apply to O-maximizers?
Since the construction was incorrect, this argument does not hold.
you could, in theory, train a Solomonoff Induction-based reinforcement learning agent to produce arbitrary finite sequences of actions (non-self-destructive ones anyway) in response to specified sets of finite sense data—assuming you are allowed to program its reward function and give it fake memories dating back from before it was born.
Basically Solomonoff Induction is a powerful learning mechanism, and with sufficient time and test cases, you could configure an agent based on it to behave in an arbitrary way[*] in response to any finite sense-stream after its “birth”—by giving it sufficient pre-birth training “memories”—which laboriously say: “if you see this, do this, and don’t do this or this or this”—for every possible bunch of observations, up to some finite length limit.
I call this sort of thing universal action—and I think reinforcement learning systems are capable of it.
I requested feedback about this paper here.
One of my conclusions was that you could, in theory, train a Solomonoff Induction-based reinforcement learning agent to produce arbitrary finite sequences of actions (non-self-destructive ones anyway) in response to specified sets of finite sense data—assuming you are allowed to program its reward function and give it fake memories dating back from before it was born.
This is essentially the same result as is claimed for O-Maximisers in the paper. This undermines the thesis that O-Maximisers somehow exhibit different dynamics from reinforcement learning agents.
Update on 2011-04-30: Bill Hibbard makes an almost identical point to the observations I made in this comment. You can see it in his post—on the AGI mailing list—here.
Thanks for posting this around! It’s great to see it creating discussion.
I’m working on replies to the points you, Bill Hibbard, and Curt Welch have made. It looks like I have some explaining to do if I want to convince you that O-maximizers aren’t a subset of reward maximizers—in particular, that my argument in appendix B doesn’t apply to O-maximizers.
To recap, my position is that both expected reward maximisers and expected utility maximisers are universal learners—and so can perform practically any series of non-self-destructive actions in a configurable manner in response to inputs. So, I don’t think either system necessarily exhibits the “characteristic behaviour” you describe.
Response to Curt Welch:
Zing! I guess he didn’t think I was going to be reading that. To be fair, it may seem to him that I’ve made a stupid error, thinking that O-maximizers behave differently than reward maximizers. I’ll try to explain why he’s mistaken.
A reward maximizer acts so as to bring about universes in which the rewards it receives are maximized. For this reason, it will predict and may manipulate the future actions of its rewarder.
An O-maximizer with utility function U acts so as to bring about universes which score highly according to U. For this reason, it is quite unlikely to manipulate or alter its utility function, unless its utility function directly values universes in which it self-alters.
In particular, note that an O-maximizer does not act so as to bring about universes in which the utility it assigns to the universe is maximized. Where the reward maximizer predicts and “cares about” what the rewarder will say tomorrow, an O-maximizer uses its current utility function to evaluate futures and choose actions.
O-maximizers and reward maximizers have different relationships with their “motivators” (utility function vs. rewarder), and they behave differently when given the option to alter their motivators. It seems clear to me that they are distinct.
Actually trying to apply the argument in Appendix B to an O-maximizer, implemented or in the abstract, using the definitions given in the paper instead of reasoning by analogy, is sufficient to show that this is also incorrect.
It may have the option, but will it be motivated to alter its “reward function”? Consider an O-maximizer with utility function U. It acts to maximize the universe’s utility as measured by U. How would the agent’s alteration of its own utility function bring about universes that score highly according to U?
OK, some responses from me:
The more obvious problem for utility maximisers is fake utility.
My position here is a bit different from Curt’s. Curt will argue that both systems are likely to wirehead (and I don’t necessarily disagree—the set-up in the paper is not sufficient to prevent wireheading, IMO). My angle is more that both types of systems can be made into universal agents—producing arbitrary finite action sequenes in response to whatever inputs you like.
...but your characterisation of the behaviour of reward maximizers and utility maximisers seems ratther like a projection to me. IMO, actual behaviour will depend on what the systems believe their purpose is when they come to adjusting their brains. Since they both lack knowledge of the design purpose of their own goal systems, ISTM that the outcome could potentially vary. Maybe they will wirehead, maybe they won’t.
Ah, I see. Thanks for taking the time to discuss this—you’ve raised some helpful points about how my argument will need to be strengthened (“universal action” is good food for thought) and clarified (clearly, my account of wireheading is unconvincing).
The paper’s been accepted, and I have a ton of editing to do (need to cut four pages!), so I may not be very quick to respond for the time being. I didn’t want to disappear without warning, and without saying thanks for your time!
OK. I am skepical that the wirehead problem can be solved simply by invoking expected utillity maximisation. IMO, there are at least two problems that go beyond that:
How do you tell the system to maximise (say) temperature—and not some kind of proxy or perception of temperature?
How do you construct a practical inductive inference engine without using reinforcement learning?
FWIW, my current position is that this probably isn’t our problem. The wirehead problem doesn’t become serious until relatively late on—leaving plenty of scope for transforming the world into a smarter place in the mean time.
Response to Bill Hibbard:
In the reward-maximization framework, rewards are part of observations and come from the environment. You cannot define “r sub(m)” to be equal to something mathematically, then call the result a reward-maximizer; therefore, Hibbard’s formulation of an O-maximizer as a reward-maximizer doesn’t work.
Since the construction was incorrect, this argument does not hold.
My way of putting much the same idea was:
Basically Solomonoff Induction is a powerful learning mechanism, and with sufficient time and test cases, you could configure an agent based on it to behave in an arbitrary way[*] in response to any finite sense-stream after its “birth”—by giving it sufficient pre-birth training “memories”—which laboriously say: “if you see this, do this, and don’t do this or this or this”—for every possible bunch of observations, up to some finite length limit.
I call this sort of thing universal action—and I think reinforcement learning systems are capable of it.
Bill responds here. It is pretty much what I expected him to say.