There is very little distinction, from the point of view of actual behaviors, between a supposedly-Friendly-but-actually-not AI, and a regular UFAI. Well, maybe the former will wait a bit longer before its pathological behavior shows up. Maybe. I really don’t want to be the sorry bastard who tries that experiment: it would just be downright embarrassing.
But of course, the simplest way to bypass this is precisely to be able to, as previously mentioned in my comment and by nearly all authors on the issue, specify the utility function as the outcome of an inference problem, thus ensuring that additional interaction with humans causes the AI to update its utility function and become Friendlier with time.
Causal inference that allows for deliberate conditioning of distributions on complex, counterfactual scenarios should actually help with this. Causal reasoning does dissolve into counterfactual reasoning, after all, so rational action on evaluative criteria can be considered a kind of push-and-pull force acting on an agent’s trajectory through the space of possible histories: undesirable counterfactuals push the agent’s actions away (ie: push the agent to prevent their becoming real), while desirable counterfactuals pull the agent’s actions towards themselves (ie: the agent takes actions to achieve those events as goals) :-p.
There is very little distinction, from the point of view of actual behaviors, between a supposedly-Friendly-but-actually-not AI, and a regular UFAI. Well, maybe the former will wait a bit longer before its pathological behavior shows up. Maybe. I really don’t want to be the sorry bastard who tries that experiment: it would just be downright embarrassing.
But of course, the simplest way to bypass this is precisely to be able to, as previously mentioned in my comment and by nearly all authors on the issue, specify the utility function as the outcome of an inference problem, thus ensuring that additional interaction with humans causes the AI to update its utility function and become Friendlier with time.
Causal inference that allows for deliberate conditioning of distributions on complex, counterfactual scenarios should actually help with this. Causal reasoning does dissolve into counterfactual reasoning, after all, so rational action on evaluative criteria can be considered a kind of push-and-pull force acting on an agent’s trajectory through the space of possible histories: undesirable counterfactuals push the agent’s actions away (ie: push the agent to prevent their becoming real), while desirable counterfactuals pull the agent’s actions towards themselves (ie: the agent takes actions to achieve those events as goals) :-p.