This is based more on experience than on a full formal argument (yet). Take an AI that, according to our preferences, is low impact and still does stuff. Then there is a utility function U for which that “does stuff” is the single worst and highest impact thing the AI could have done (you just trivially define a U that only cares about that “stuff”).
Now, that’s a contrived case, but my experience is that problems like that come up all the time in low impact research, and that we really need to include—explicitly or implicitly—a lot of our values/preferences directly, in order to have something that satisfies low impact.
This seems to prove too much; the same argument proves friendly behavior can’t exist ever, or that including our preferences directly is (literally) impossible. The argument doesn’t show that that utility has to be important to / considered by the impact measure.
Plus, low impact doesn’t have to be robust to adversarially chosen input attainable utilities—we get to choose them. Just choose the “am I activated” indicator utility and AUP seems to do fine, modulo open questions raised in the post and comments.
This seems to prove too much; the same argument proves friendly behavior can’t exist ever, or that including our preferences directly is (literally) impossible.
? I don’t see that. What’s the argument?
(If you want to say that we can’t define friendly behaviour without using our values, then I would agree ^_^ but I think you’re trying to argue something else).
Take a friendly AI that does stuff. Then there is a utility function for which that “does stuff” is the single worst thing the AI could have done.
The fact that no course of action is universally friendly doesn’t mean it can’t be friendly for us.
As I understand it, the impact version of this argument is flawed in the same way (but less blatantly so): something being high impact according to a contrived utility function doesn’t mean we can’t induce behavior that is, with high probability, low impact for the vast majority of reasonable utility functions.
The fact that no course of action is universally friendly doesn’t mean it can’t be friendly for us.
Indeed, by “friendly AI” I meant “an AI friendly for us”. So yes, I was showing a contrived example of an AI that was friendly, and low impact, from our perspective, but that was not, as you said, universally friendly (or universally low impact).
something being high impact according to a contrived utility function doesn’t mean we can’t induce behavior that is, with high probability, low impact for the vast majority of reasonable utility functions.
In my experience so far, we need to include our values, in part, to define “reasonable” utility functions.
In my experience so far, we need to include our values, in part, to define “reasonable” utility functions.
It seems that an extremely broad set of input attainable functions suffice to capture the “reasonable“ functions with respect to which we want to be low impact. For example,
“remaining on”, “reward linear in how many blue pixels are observed each time step”, etc. All thanks to instrumental convergence and opportunity cost.
This is based more on experience than on a full formal argument (yet). Take an AI that, according to our preferences, is low impact and still does stuff. Then there is a utility function U for which that “does stuff” is the single worst and highest impact thing the AI could have done (you just trivially define a U that only cares about that “stuff”).
Now, that’s a contrived case, but my experience is that problems like that come up all the time in low impact research, and that we really need to include—explicitly or implicitly—a lot of our values/preferences directly, in order to have something that satisfies low impact.
This seems to prove too much; the same argument proves friendly behavior can’t exist ever, or that including our preferences directly is (literally) impossible. The argument doesn’t show that that utility has to be important to / considered by the impact measure.
Plus, low impact doesn’t have to be robust to adversarially chosen input attainable utilities—we get to choose them. Just choose the “am I activated” indicator utility and AUP seems to do fine, modulo open questions raised in the post and comments.
? I don’t see that. What’s the argument?
(If you want to say that we can’t define friendly behaviour without using our values, then I would agree ^_^ but I think you’re trying to argue something else).
Take a friendly AI that does stuff. Then there is a utility function for which that “does stuff” is the single worst thing the AI could have done.
The fact that no course of action is universally friendly doesn’t mean it can’t be friendly for us.
As I understand it, the impact version of this argument is flawed in the same way (but less blatantly so): something being high impact according to a contrived utility function doesn’t mean we can’t induce behavior that is, with high probability, low impact for the vast majority of reasonable utility functions.
Indeed, by “friendly AI” I meant “an AI friendly for us”. So yes, I was showing a contrived example of an AI that was friendly, and low impact, from our perspective, but that was not, as you said, universally friendly (or universally low impact).
In my experience so far, we need to include our values, in part, to define “reasonable” utility functions.
It seems that an extremely broad set of input attainable functions suffice to capture the “reasonable“ functions with respect to which we want to be low impact. For example, “remaining on”, “reward linear in how many blue pixels are observed each time step”, etc. All thanks to instrumental convergence and opportunity cost.