I’m proposing rewarding the AGI based on the initial utility function of it’s user…
For one thing, inner misalignment can just be really weird and random. As a human example, consider superstitions. There’s nothing in our evolutionary history that should make a human have a desire to carry around a rabbit’s foot, and nothing in our genome, and nothing in our current environment that makes it a useful thing to do. But some people want to do that anyway. I think of this human example as credit assignment failure, a random coincidence that causes something in the agent’s world-model to get spuriously painted with positive valence.
Deceptively-aligned mesa-optimizers is another story with a similar result; the upshot is that you can get an agent with a literally random goal. Or at least, it seems difficult to rule that out.
But let’s set aside those types of problems.
Let’s say we’re running our virtual sandbox on a server. The NPC’s final utility, as calculated according to its initial utility function, is stored in RAM register 7. Here are two possible goals that the AGI might wind up with:
My goal is to maximize the NPC’s final utility, as calculated according to its initial utility function
My goal is to maximize the value stored in RAM register 7.
In retrospect, I shouldn’t have used the term “wireheading the NPC” for the second thing. Sorry for any confusion. But whatever we call it, it’s a possible goal that an AI might have, and it leads to identical perfect behavior in the secure sandbox virtual environment, but it leads to very wrong behavior when the AI gets sufficiently powerful that a new action space opens up to it. Do you agree?
(A totally separate issue is that humans don’t have utility functions and sometimes want their goals to change over time.)
a human-value-maximizing AI…would wirehead us?
Probably not, but I’m not 100% sure what you mean by “human values”.
I think some humans are hedonists who care minimally (if at all) about anything besides their own happiness, but most are not.
For one thing, inner misalignment can just be really weird and random. As a human example, consider superstitions. There’s nothing in our evolutionary history that should make a human have a desire to carry around a rabbit’s foot, and nothing in our genome, and nothing in our current environment that makes it a useful thing to do. But some people want to do that anyway. I think of this human example as credit assignment failure, a random coincidence that causes something in the agent’s world-model to get spuriously painted with positive valence.
Deceptively-aligned mesa-optimizers is another story with a similar result; the upshot is that you can get an agent with a literally random goal. Or at least, it seems difficult to rule that out.
But let’s set aside those types of problems.
Let’s say we’re running our virtual sandbox on a server. The NPC’s final utility, as calculated according to its initial utility function, is stored in RAM register 7. Here are two possible goals that the AGI might wind up with:
My goal is to maximize the NPC’s final utility, as calculated according to its initial utility function
My goal is to maximize the value stored in RAM register 7.
In retrospect, I shouldn’t have used the term “wireheading the NPC” for the second thing. Sorry for any confusion. But whatever we call it, it’s a possible goal that an AI might have, and it leads to identical perfect behavior in the secure sandbox virtual environment, but it leads to very wrong behavior when the AI gets sufficiently powerful that a new action space opens up to it. Do you agree?
(A totally separate issue is that humans don’t have utility functions and sometimes want their goals to change over time.)
Probably not, but I’m not 100% sure what you mean by “human values”.
I think some humans are hedonists who care minimally (if at all) about anything besides their own happiness, but most are not.