Wait, you’re proposing to give the AGI an opportunity to hack into the simulation code and wirehead the NPC, during training? That seems hard to implement, right? When the AI is messing with the computer’s RAM, how do you ensure that it gets the reward you intended? What if the AI deletes the reward register!?
And if you’re not proposing that, then I disagree with “wireheading the NPC shouldn’t be an effective strategy for the AI”. Or rather, the question isn’t what is or isn’t an effective strategy, but rather what will the AI actually do. (For example, if the AI executes a kill-every-human plan, it’s no consolation if, from our perspective, the kill-every-human plan “shouldn’t be an effective strategy”!) From the AI’s perspective, it will have no direct experience to indicate whether a plan to hack into the simulation code and wirehead the NPC is a good plan or not. Instead, the AI’s assessment of that plan would have to involve extrapolating out-of-distribution, and that’s generally a tricky thing to predict and reason about—it depends on the structure of the AI’s internal world-model etc.
(Maybe you have a background assumption that the AI will be trying to maximize its future reward, whereas I don’t—Section 9.5 here.)
I’m proposing rewarding the AGI based on the initial utility function of it’s user. Changing that function, by e.g. wireheading or simply killing them (the user is expressing zero dissatisfaction. Mission accomplished!) does not increase the reward of the agent. I’m saying that it shouldn’t be an effective strategy the same way that AlphaGo doesn’t decide to draw smiley faces on the go board-that’s not something rewarded, so the agent shouldn’t be drawn towards it.
To clarify, do you expect humans to want to be wireheaded, such that a human-value-maximizing AI (or quantilizing, or otherizing; the fact that a different strategy than maximization might be vastly better is part of the plan here) would wirehead us? Or that we’d approve of wireheading afterwards? Or that we wouldn’t want it either before or afterwards, but that the AI might nevertheless think this was a good idea? Answering you further will be much more constructive if this point is clear.
As for extrapolation out of distribution, that’s certainly a source of risk. However, while one wouldn’t necessarily want the AI to hack the training sim (although if it was inclined to do so, seeing it do so during training would potentially help catch that problem… (though of course deception risks mean that’s not a guarantee)), wireheading might be doable in the training environment, at least if the user is represented in sim as a neural net and not a black box. Also, directly stimulating an agent’s pleasure centers isn’t the only expression of the wireheading dynamic; the more general failure mode here is short circuiting a process, jumping to the destination when we also value the journey.
For example, imagine a Sisyphus user: they want to roll a rock up a hill again and again (technically that’s more like the opposite of Sisyphus, who didn’t want to have to do that, but anyway). If the AI thinks of their reward as having the rock reach the top, they might rapidly vibrate the rock on top of the hill so that it kept counting as “reaching the summit”. While that isn’t direct wireheading, it’s the same sort of failure mode, and the simulation should discourage it (unless that’s what the Sisyphus agent actually wants, and their rolling it all the way up was simply due to their lack of ability to adopt the vibration solution).
I’m proposing rewarding the AGI based on the initial utility function of it’s user…
For one thing, inner misalignment can just be really weird and random. As a human example, consider superstitions. There’s nothing in our evolutionary history that should make a human have a desire to carry around a rabbit’s foot, and nothing in our genome, and nothing in our current environment that makes it a useful thing to do. But some people want to do that anyway. I think of this human example as credit assignment failure, a random coincidence that causes something in the agent’s world-model to get spuriously painted with positive valence.
Deceptively-aligned mesa-optimizers is another story with a similar result; the upshot is that you can get an agent with a literally random goal. Or at least, it seems difficult to rule that out.
But let’s set aside those types of problems.
Let’s say we’re running our virtual sandbox on a server. The NPC’s final utility, as calculated according to its initial utility function, is stored in RAM register 7. Here are two possible goals that the AGI might wind up with:
My goal is to maximize the NPC’s final utility, as calculated according to its initial utility function
My goal is to maximize the value stored in RAM register 7.
In retrospect, I shouldn’t have used the term “wireheading the NPC” for the second thing. Sorry for any confusion. But whatever we call it, it’s a possible goal that an AI might have, and it leads to identical perfect behavior in the secure sandbox virtual environment, but it leads to very wrong behavior when the AI gets sufficiently powerful that a new action space opens up to it. Do you agree?
(A totally separate issue is that humans don’t have utility functions and sometimes want their goals to change over time.)
a human-value-maximizing AI…would wirehead us?
Probably not, but I’m not 100% sure what you mean by “human values”.
I think some humans are hedonists who care minimally (if at all) about anything besides their own happiness, but most are not.
Wait, you’re proposing to give the AGI an opportunity to hack into the simulation code and wirehead the NPC, during training? That seems hard to implement, right? When the AI is messing with the computer’s RAM, how do you ensure that it gets the reward you intended? What if the AI deletes the reward register!?
And if you’re not proposing that, then I disagree with “wireheading the NPC shouldn’t be an effective strategy for the AI”. Or rather, the question isn’t what is or isn’t an effective strategy, but rather what will the AI actually do. (For example, if the AI executes a kill-every-human plan, it’s no consolation if, from our perspective, the kill-every-human plan “shouldn’t be an effective strategy”!) From the AI’s perspective, it will have no direct experience to indicate whether a plan to hack into the simulation code and wirehead the NPC is a good plan or not. Instead, the AI’s assessment of that plan would have to involve extrapolating out-of-distribution, and that’s generally a tricky thing to predict and reason about—it depends on the structure of the AI’s internal world-model etc.
(Maybe you have a background assumption that the AI will be trying to maximize its future reward, whereas I don’t—Section 9.5 here.)
I’m proposing rewarding the AGI based on the initial utility function of it’s user. Changing that function, by e.g. wireheading or simply killing them (the user is expressing zero dissatisfaction. Mission accomplished!) does not increase the reward of the agent. I’m saying that it shouldn’t be an effective strategy the same way that AlphaGo doesn’t decide to draw smiley faces on the go board-that’s not something rewarded, so the agent shouldn’t be drawn towards it.
To clarify, do you expect humans to want to be wireheaded, such that a human-value-maximizing AI (or quantilizing, or otherizing; the fact that a different strategy than maximization might be vastly better is part of the plan here) would wirehead us? Or that we’d approve of wireheading afterwards? Or that we wouldn’t want it either before or afterwards, but that the AI might nevertheless think this was a good idea? Answering you further will be much more constructive if this point is clear.
As for extrapolation out of distribution, that’s certainly a source of risk. However, while one wouldn’t necessarily want the AI to hack the training sim (although if it was inclined to do so, seeing it do so during training would potentially help catch that problem… (though of course deception risks mean that’s not a guarantee)), wireheading might be doable in the training environment, at least if the user is represented in sim as a neural net and not a black box. Also, directly stimulating an agent’s pleasure centers isn’t the only expression of the wireheading dynamic; the more general failure mode here is short circuiting a process, jumping to the destination when we also value the journey.
For example, imagine a Sisyphus user: they want to roll a rock up a hill again and again (technically that’s more like the opposite of Sisyphus, who didn’t want to have to do that, but anyway). If the AI thinks of their reward as having the rock reach the top, they might rapidly vibrate the rock on top of the hill so that it kept counting as “reaching the summit”. While that isn’t direct wireheading, it’s the same sort of failure mode, and the simulation should discourage it (unless that’s what the Sisyphus agent actually wants, and their rolling it all the way up was simply due to their lack of ability to adopt the vibration solution).
For one thing, inner misalignment can just be really weird and random. As a human example, consider superstitions. There’s nothing in our evolutionary history that should make a human have a desire to carry around a rabbit’s foot, and nothing in our genome, and nothing in our current environment that makes it a useful thing to do. But some people want to do that anyway. I think of this human example as credit assignment failure, a random coincidence that causes something in the agent’s world-model to get spuriously painted with positive valence.
Deceptively-aligned mesa-optimizers is another story with a similar result; the upshot is that you can get an agent with a literally random goal. Or at least, it seems difficult to rule that out.
But let’s set aside those types of problems.
Let’s say we’re running our virtual sandbox on a server. The NPC’s final utility, as calculated according to its initial utility function, is stored in RAM register 7. Here are two possible goals that the AGI might wind up with:
My goal is to maximize the NPC’s final utility, as calculated according to its initial utility function
My goal is to maximize the value stored in RAM register 7.
In retrospect, I shouldn’t have used the term “wireheading the NPC” for the second thing. Sorry for any confusion. But whatever we call it, it’s a possible goal that an AI might have, and it leads to identical perfect behavior in the secure sandbox virtual environment, but it leads to very wrong behavior when the AI gets sufficiently powerful that a new action space opens up to it. Do you agree?
(A totally separate issue is that humans don’t have utility functions and sometimes want their goals to change over time.)
Probably not, but I’m not 100% sure what you mean by “human values”.
I think some humans are hedonists who care minimally (if at all) about anything besides their own happiness, but most are not.