I’m fairly skeptical about trying to understand AI behavior at this level, given the current state of affairs (that is, I think the implicit picture of AI behavior on which these analyses rely is quite unlikely, so that the utility of this sort of thinking is reduced by an order of magnitude). Anyway, some specific notes:
The utility scrambled situation is probably as dangerous as more subtle perturbations if you are dealing with human-level AI, as keeping human onlookers happy is instrumentally valuable (and this sort of reasoning is obvious to an AI as clever as we are on this axis, never mind one much smarter).
The presumed AI architecture involves human designers specifying a prior and utility function over the same ontology, which seems quite unlikely from here. In more realistic situations, the question of value generalization seems important beyond ontological crises, and in particular if it goes well before reaching an ontological crisis it seems overwhelmingly likely to continue to go well.
An AI of the sort you envision (with a prior and a utility function specified in the ontology of that prior) can never abandon its ontology. It will instead either become increasingly confused, or build a model for its observations in the original ontology (if the prior is sufficiently expressive). In both cases the utility function continues to apply without change, in contrast to the situation in de Blanc’s paper (where an AI is explicitly shifting from one ontology to another). If the utility function was produced by human designers it may no longer correspond with reality in the intended way.
It seems extremely unlikely that an AI with very difficult to influence values will be catatonic. More likely hypotheses suggest themselves, such as: doing things that would be good in (potentially unlikely worlds) where value is more easily influenced, amassing resources to better understand whether value can be influenced, or having behavior controlled in apparently random (but quite likely extremely destructive) ways that give a tiny probabilistic edge. For only very rare values will killing yourself be a good play (since this suggests utility can be influenced by killing yourself, but not by doing anything more extreme).
The rest are unrelated to the substance of the post, except insofar as they relate to the general mode of thinking:
As far as I can tell, AI indifference doesn’t work (see my comment here). I don’t think it is salvageable, but even if it is it at least seems to require salvaging.
Note that depending on the structure of “evidence for goals” in the value indifference proposal, it is possible that an AI can in fact purposefully influence its utility function and will be motivated to do so. To see that the proof sketch given doesn’t work, notice that I have some probability distribution over what I will be doing in a year, but that (despite the fact that this “obeys the axioms of probability”) I can in fact influence the result and not just passively learn more about it. An agent in this framework is automatically going to be concerned with acausal control of its utility function, if its notion of evidence is sufficiently well-developed. I don’t know if this is an issue.
More likely hypotheses suggest themselves, such as: doing things that would be good in (potentially unlikely worlds) where value is more easily influenced, amassing resources to better understand whether value can be influenced, or having behavior controlled in apparently random (but quite likely extremely destructive) ways that give a tiny probabilistic edge.
An important point that I think doesn’t have a post highlighting it. An AI that only cares about moving one dust speck by one micrometer on some planet in a distant galaxy if that planet satisfies a very unlikely condition (and thus most likely isn’t present in the universe) will still take over the universe on the off-chance that the dust speck is there.
Impossible to influence values, not just very difficult.
Nothing is impossible. Maybe AI’s hardware is faulty (and that is why it computes 2+2=4 every time), which would prompt AI to investigate the issue more thoroughly, if it has nothing better to do.
(This is more of an out-of-context remark, since I can’t place “influencing own values”. If “values” are not values, and instead something that should be “influenced” for some reason, why do they matter?)
I’m fairly skeptical about trying to understand AI behavior at this level, given the current state of affairs (that is, I think the implicit picture of AI behavior on which these analyses rely is quite unlikely, so that the utility of this sort of thinking is reduced by an order of magnitude). Anyway, some specific notes:
The utility scrambled situation is probably as dangerous as more subtle perturbations if you are dealing with human-level AI, as keeping human onlookers happy is instrumentally valuable (and this sort of reasoning is obvious to an AI as clever as we are on this axis, never mind one much smarter).
The presumed AI architecture involves human designers specifying a prior and utility function over the same ontology, which seems quite unlikely from here. In more realistic situations, the question of value generalization seems important beyond ontological crises, and in particular if it goes well before reaching an ontological crisis it seems overwhelmingly likely to continue to go well.
An AI of the sort you envision (with a prior and a utility function specified in the ontology of that prior) can never abandon its ontology. It will instead either become increasingly confused, or build a model for its observations in the original ontology (if the prior is sufficiently expressive). In both cases the utility function continues to apply without change, in contrast to the situation in de Blanc’s paper (where an AI is explicitly shifting from one ontology to another). If the utility function was produced by human designers it may no longer correspond with reality in the intended way.
It seems extremely unlikely that an AI with very difficult to influence values will be catatonic. More likely hypotheses suggest themselves, such as: doing things that would be good in (potentially unlikely worlds) where value is more easily influenced, amassing resources to better understand whether value can be influenced, or having behavior controlled in apparently random (but quite likely extremely destructive) ways that give a tiny probabilistic edge. For only very rare values will killing yourself be a good play (since this suggests utility can be influenced by killing yourself, but not by doing anything more extreme).
The rest are unrelated to the substance of the post, except insofar as they relate to the general mode of thinking:
As far as I can tell, AI indifference doesn’t work (see my comment here). I don’t think it is salvageable, but even if it is it at least seems to require salvaging.
Note that depending on the structure of “evidence for goals” in the value indifference proposal, it is possible that an AI can in fact purposefully influence its utility function and will be motivated to do so. To see that the proof sketch given doesn’t work, notice that I have some probability distribution over what I will be doing in a year, but that (despite the fact that this “obeys the axioms of probability”) I can in fact influence the result and not just passively learn more about it. An agent in this framework is automatically going to be concerned with acausal control of its utility function, if its notion of evidence is sufficiently well-developed. I don’t know if this is an issue.
An important point that I think doesn’t have a post highlighting it. An AI that only cares about moving one dust speck by one micrometer on some planet in a distant galaxy if that planet satisfies a very unlikely condition (and thus most likely isn’t present in the universe) will still take over the universe on the off-chance that the dust speck is there.
Impossible to influence values, not just very difficult.
Which would also mean doing things that would be bad in other unlikely worlds.
See my comment on your comment.
Nothing is impossible. Maybe AI’s hardware is faulty (and that is why it computes 2+2=4 every time), which would prompt AI to investigate the issue more thoroughly, if it has nothing better to do.
(This is more of an out-of-context remark, since I can’t place “influencing own values”. If “values” are not values, and instead something that should be “influenced” for some reason, why do they matter?)