>Request labels for training data points which have maximal value of information.
I can see many ways this can be extremely manipulative. If you request a series of training data points who’s likely result, once the human answers them, is the conclusion “the human wants me to lobotomise them into a brainless drugged pleasure maximiser and never change them again”, then your request have maximal value of information. Therefore if such a series of training data points exist, the AI will be motivated to find them—and hence manipulate the human.
If you request a series of training data points who’s likely result, once the human answers them
If you already know how the human is going to answer, then it’s not high value of information to ask. “If you can anticipate in advance updating your belief in a particular direction, then you should just go ahead and update now. Once you know your destination, you are already there.”
Suppose it is high value of information for the AI to ask whether we’d like to be lobotomized drugged pleasure maximizers. In that case, it’s a perfectly reasonable thing for the AI to ask: We would like for the AI to request clarification if it places significant probability mass on the possibility that we assign loads of utility to being lobotomized drugged pleasure maximizers! The key question is whether the AI would optimize for asking this question in a manipulative way—a way designed to change our answers. An AI might do this is if it’s able to anticipate the manipulative effects of its questions. Luckily, making it so the AI doesn’t anticipate the manipulative effects of its questions appears to be technically straightforward: If the scorekeeper operates by conservation of expected evidence, it can never believe any sequence of questions will modify the score of any particular scenario on average.
There are 3 cases here:
The AI assigns a very low probability to us desiring lobotomy. In this case, there is no problem: We don’t actually want lobotomy, and it would be very low value of information to ask about lobotomy (because the chance of a “hit”, where we say yes to lobotomy and the AI learns it can achieve lots of utility by giving us lobotomy, is quite low from the AI’s perspective).
The AI is fairly uncertain about whether we want lobotomy. It believes we might really want it, but we also might really not want it! In that case, it is high VoI to ask us about lobotomy before taking action. This is the scenario I discuss under “Smile maximization case study” in my essay. The AI may ask us about the version of lobotomy it thinks we are most likely to want, if that is the highest VoI thing to ask about, but that still doesn’t seem like a huge problem.
The AI assigns a very high probability to us desiring lobotomy and doesn’t think there’s much of a chance that we don’t want it. In that case, we have lost. The key challenge for my proposal is to figure out how prevent the AI from entering a state where it has confident yet wildly incorrect beliefs about our preferences. From my perspective, FAI boils down to a problem of statistical epistemology.
>If you already know how the human is going to answer, then it’s not high value of information to ask.
That’s the entire problem, if “ask a human” is programmed as a an endorsement of this being the right path to take, rather than as a genuine need for information.
>If the scorekeeper operates by conservation of expected evidence, it can never believe any sequence of questions will modify the score of any particular scenario on average.
>Request labels for training data points which have maximal value of information.
I can see many ways this can be extremely manipulative. If you request a series of training data points who’s likely result, once the human answers them, is the conclusion “the human wants me to lobotomise them into a brainless drugged pleasure maximiser and never change them again”, then your request have maximal value of information. Therefore if such a series of training data points exist, the AI will be motivated to find them—and hence manipulate the human.
If you already know how the human is going to answer, then it’s not high value of information to ask. “If you can anticipate in advance updating your belief in a particular direction, then you should just go ahead and update now. Once you know your destination, you are already there.”
Suppose it is high value of information for the AI to ask whether we’d like to be lobotomized drugged pleasure maximizers. In that case, it’s a perfectly reasonable thing for the AI to ask: We would like for the AI to request clarification if it places significant probability mass on the possibility that we assign loads of utility to being lobotomized drugged pleasure maximizers! The key question is whether the AI would optimize for asking this question in a manipulative way—a way designed to change our answers. An AI might do this is if it’s able to anticipate the manipulative effects of its questions. Luckily, making it so the AI doesn’t anticipate the manipulative effects of its questions appears to be technically straightforward: If the scorekeeper operates by conservation of expected evidence, it can never believe any sequence of questions will modify the score of any particular scenario on average.
There are 3 cases here:
The AI assigns a very low probability to us desiring lobotomy. In this case, there is no problem: We don’t actually want lobotomy, and it would be very low value of information to ask about lobotomy (because the chance of a “hit”, where we say yes to lobotomy and the AI learns it can achieve lots of utility by giving us lobotomy, is quite low from the AI’s perspective).
The AI is fairly uncertain about whether we want lobotomy. It believes we might really want it, but we also might really not want it! In that case, it is high VoI to ask us about lobotomy before taking action. This is the scenario I discuss under “Smile maximization case study” in my essay. The AI may ask us about the version of lobotomy it thinks we are most likely to want, if that is the highest VoI thing to ask about, but that still doesn’t seem like a huge problem.
The AI assigns a very high probability to us desiring lobotomy and doesn’t think there’s much of a chance that we don’t want it. In that case, we have lost. The key challenge for my proposal is to figure out how prevent the AI from entering a state where it has confident yet wildly incorrect beliefs about our preferences. From my perspective, FAI boils down to a problem of statistical epistemology.
>If you already know how the human is going to answer, then it’s not high value of information to ask.
That’s the entire problem, if “ask a human” is programmed as a an endorsement of this being the right path to take, rather than as a genuine need for information.
>If the scorekeeper operates by conservation of expected evidence, it can never believe any sequence of questions will modify the score of any particular scenario on average.
That’s precisely my definition for “unriggable” learning processes, in the next post:https://www.lesswrong.com/posts/upLot6eG8cbXdKiFS/reward-function-learning-the-learning-process
That’s a link to this post, right? ;)
Ooops, yes! Sorry, for some reason, I thought this was the post on the value function.