A reward maximizer acts so as to bring about universes in which the rewards it receives are maximized. For this reason, it will predict and may manipulate the future actions of its rewarder.
An O-maximizer with utility function U acts so as to bring about universes which score highly according to U. For this reason, it is quite unlikely to manipulate or alter its utility function
The more obvious problem for utility maximisers is fake utility.
Actually trying to apply the argument in Appendix B to an O-maximizer [...] is sufficient to show that this is also incorrect.
My position here is a bit different from Curt’s. Curt will argue that both systems are likely to wirehead (and I don’t necessarily disagree—the set-up in the paper is not sufficient to prevent wireheading, IMO). My angle is more that both types of systems can be made into universal agents—producing arbitrary finite action sequenes in response to whatever inputs you like.
The more obvious problem for utility maximisers is fake utility.
...but your characterisation of the behaviour of reward maximizers and utility maximisers seems ratther like a projection to me. IMO, actual behaviour will depend on what the systems believe their purpose is when they come to adjusting their brains. Since they both lack knowledge of the design purpose of their own goal systems, ISTM that the outcome could potentially vary. Maybe they will wirehead, maybe they won’t.
Ah, I see. Thanks for taking the time to discuss this—you’ve raised some helpful points about how my argument will need to be strengthened (“universal action” is good food for thought) and clarified (clearly, my account of wireheading is unconvincing).
The paper’s been accepted, and I have a ton of editing to do (need to cut four pages!), so I may not be very quick to respond for the time being. I didn’t want to disappear without warning, and without saying thanks for your time!
OK. I am skepical that the wirehead problem can be solved simply by invoking expected utillity maximisation. IMO, there are at least two problems that go beyond that:
How do you tell the system to maximise (say) temperature—and not some kind of proxy or perception of temperature?
How do you construct a practical inductive inference engine without using reinforcement learning?
FWIW, my current position is that this probably isn’t our problem. The wirehead problem doesn’t become serious until relatively late on—leaving plenty of scope for transforming the world into a smarter place in the mean time.
OK, some responses from me:
The more obvious problem for utility maximisers is fake utility.
My position here is a bit different from Curt’s. Curt will argue that both systems are likely to wirehead (and I don’t necessarily disagree—the set-up in the paper is not sufficient to prevent wireheading, IMO). My angle is more that both types of systems can be made into universal agents—producing arbitrary finite action sequenes in response to whatever inputs you like.
...but your characterisation of the behaviour of reward maximizers and utility maximisers seems ratther like a projection to me. IMO, actual behaviour will depend on what the systems believe their purpose is when they come to adjusting their brains. Since they both lack knowledge of the design purpose of their own goal systems, ISTM that the outcome could potentially vary. Maybe they will wirehead, maybe they won’t.
Ah, I see. Thanks for taking the time to discuss this—you’ve raised some helpful points about how my argument will need to be strengthened (“universal action” is good food for thought) and clarified (clearly, my account of wireheading is unconvincing).
The paper’s been accepted, and I have a ton of editing to do (need to cut four pages!), so I may not be very quick to respond for the time being. I didn’t want to disappear without warning, and without saying thanks for your time!
OK. I am skepical that the wirehead problem can be solved simply by invoking expected utillity maximisation. IMO, there are at least two problems that go beyond that:
How do you tell the system to maximise (say) temperature—and not some kind of proxy or perception of temperature?
How do you construct a practical inductive inference engine without using reinforcement learning?
FWIW, my current position is that this probably isn’t our problem. The wirehead problem doesn’t become serious until relatively late on—leaving plenty of scope for transforming the world into a smarter place in the mean time.