I’m gonna assume the part you don’t follow is how the two are related.
My submission attempts to maximize the fraction of utility functions that the produced AI can satisfy, in hopes that the human utility function is among them.
Attainable utility preservation attempts to maximize the fraction of utility functions that can be satisfied from the produced state, in hopes that human satisfaction is not ruled out.
Neither of those seem good, since the AI being able to do the right thing is not at all the same as us being able to get the AI to do the right thing. If someone steals your TV, they “could” easily give your TV back, but that doesn’t mean you can actually get them to do that. So your reading isn’t unreasonable, but that’s not AUP.
Introducing AUP over all utility functions was a mistake, since it’s a total red herring. Briefly, AUP is about designing an agent that achieves its goal without acting to gain or lose power—an agent without nasty convergent instrumental incentives. Eg “Make paperclips while being penalized for becoming more or less able to make paperclips”. We don’t need anything like realizability for this.
For purposes of the linked submission, we get the correct utility function after the hypercomputer is done running. Ideally, we would save each utility function’s preferred AI that then select the one that was preferred by the correct one, but we only have 1TB of space. Therefore we get almost all of them to agree on a parametrized solution.
AUP over all utility functions might therefore make sense for a limited environment, such as the interior of a box?
Do you mean that utility functions that do not want the AI to seize power at much cost are rare enough that the terabyte that has the highest approval rating will not be approved by a human utility function?
I’m gonna assume the part you don’t follow is how the two are related.
My submission attempts to maximize the fraction of utility functions that the produced AI can satisfy, in hopes that the human utility function is among them.
Attainable utility preservation attempts to maximize the fraction of utility functions that can be satisfied from the produced state, in hopes that human satisfaction is not ruled out.
Neither of those seem good, since the AI being able to do the right thing is not at all the same as us being able to get the AI to do the right thing. If someone steals your TV, they “could” easily give your TV back, but that doesn’t mean you can actually get them to do that. So your reading isn’t unreasonable, but that’s not AUP.
Introducing AUP over all utility functions was a mistake, since it’s a total red herring. Briefly, AUP is about designing an agent that achieves its goal without acting to gain or lose power—an agent without nasty convergent instrumental incentives. Eg “Make paperclips while being penalized for becoming more or less able to make paperclips”. We don’t need anything like realizability for this.
For purposes of the linked submission, we get the correct utility function after the hypercomputer is done running. Ideally, we would save each utility function’s preferred AI that then select the one that was preferred by the correct one, but we only have 1TB of space. Therefore we get almost all of them to agree on a parametrized solution.
AUP over all utility functions might therefore make sense for a limited environment, such as the interior of a box?
Do you mean that utility functions that do not want the AI to seize power at much cost are rare enough that the terabyte that has the highest approval rating will not be approved by a human utility function?