But the goal in question is “get the reward” and it’s only by controlling the circumstances under which the reward is given that we can shape the AIs behavior. Once the AI is capable of taking control of the trigger, why would it leave it the way we’ve set it? Whatever we’ve got it set to is almost certainly not optimized to triggering the reward.
If that happens you will then have the problem of an AI which tries to wirehead itself while simultaneously trying to control its future light-cone to make sure that nothing stops it from continuing to wirehead.
That sounds bad. It doesn’t seem obvious to me that reward seeking and reward optimizing are the same thing, but maybe they are. I don’t know and will think about it more. Thank you for talking through this with me this far.
I think the fundamental misunderstanding here is that you’re assuming that all intelligences are implicitly reward maximizers, even if their creators don’t intend to make them reward maximizers. You, as a human, and as an intelligence based on a neural network, depend on reinforcement learning. But Bostrom proposed four other possible solutions to the value loading problem besides reinforcement learning. Here are all five in the order that they were presented in Superintelligence:
Explicit representation: Literally write out its terminal goal(s) ourselves, hoping that our imaginations don’t fail us.
Evolutionary selection: Generate tons and tons of agents with lots of different sets of terminal values; delete the ones we don’t want and keep the one we do.
Reinforcement learning: Explicitly represent (see #1) one particular terminal goal: reward maximization; punish it for having undesirable instrumental goals, reward it for having desirable instrumental goals.
Associative value accretion
Motivational scaffolding
I didn’t describe the last two because they’re more complex, they’re more tentative, I don’t understand them as well, and they seem to be amalgams of the first three methods, even more so than the third method being a special case of the first.
But the goal in question is “get the reward” and it’s only by controlling the circumstances under which the reward is given that we can shape the AIs behavior. Once the AI is capable of taking control of the trigger, why would it leave it the way we’ve set it? Whatever we’ve got it set to is almost certainly not optimized to triggering the reward.
If that happens you will then have the problem of an AI which tries to wirehead itself while simultaneously trying to control its future light-cone to make sure that nothing stops it from continuing to wirehead.
That sounds bad. It doesn’t seem obvious to me that reward seeking and reward optimizing are the same thing, but maybe they are. I don’t know and will think about it more. Thank you for talking through this with me this far.
I think the fundamental misunderstanding here is that you’re assuming that all intelligences are implicitly reward maximizers, even if their creators don’t intend to make them reward maximizers. You, as a human, and as an intelligence based on a neural network, depend on reinforcement learning. But Bostrom proposed four other possible solutions to the value loading problem besides reinforcement learning. Here are all five in the order that they were presented in Superintelligence:
Explicit representation: Literally write out its terminal goal(s) ourselves, hoping that our imaginations don’t fail us.
Evolutionary selection: Generate tons and tons of agents with lots of different sets of terminal values; delete the ones we don’t want and keep the one we do.
Reinforcement learning: Explicitly represent (see #1) one particular terminal goal: reward maximization; punish it for having undesirable instrumental goals, reward it for having desirable instrumental goals.
Associative value accretion
Motivational scaffolding
I didn’t describe the last two because they’re more complex, they’re more tentative, I don’t understand them as well, and they seem to be amalgams of the first three methods, even more so than the third method being a special case of the first.
To summarize, you thought that reward maximization was the general case because, to some extent, you’re a reward maximizer. But it’s actually a special case: It’s not necessarily true about minds-in-general. An optimizer might not have a reward signal or seek to maximize one. I think this is what JoshuaZ was trying to get at before he started talking about wireheading.