It is when accuracy increases—we start looking into the internals, start resolving how that black box works—that this breaks down.
No computer program can predict its own output before actually computing it. Thus, any computable agent will necessarily have to treat some aspect of itself as a black box. If the agent isn’t stupid, and has reasonably good model of itself, and has some sort of goal in the form of “do not kill yourself”, then it will avoid messing with the parts of itself that it doesn’t understand (or at least touch them only if it has proved with substantial confidence that the modification will preserve functionality). It will also avoid breaking the parts of itself that it understands, obviously. Therefore it will not kill itself.
When evaluating conterfactual scenarios the hypothesis the agent considers is not “these signals magically appear in my output channel by some supernatural mean”, but “these signals may appear in my output channel due to some complex process that I can’t predict in full detail before I finish the current computation”.
To avoid ‘messing with the parts of itself’ it needs to be able to tell whenever actions do or do not mess with parts of itself. Moving oneself to another location, is that messing with itself or is that not? In non-isotropic universe, turning around could kill you, just as excessive accelerations could kill you in our universe.
I wouldn’t doubt that in principle you could hard-code some definition of what “parts of itself” are and what constitutes messing with them into an AI, so that it can non mess with those parts without knowing what they do, the point is that this won’t scale, and will break down if the AI gets too clever.
As for self preservation in AIXI-tl, there’s a curious anthropomorphization bias at play. Suppose that the reward was −0.999 and the lack of reward was −1 . The math that AIXI will work the same, but the common sense intuition switches from the mental image of a gluttonous hedonist that protects itself, to a tortured being yearning for death. In actuality, it’s neither, math of AIXI does not account for the destruction of the physical machinery in question one way or the other—it is neither a reward, nor lack of reward, it simply never happens in it’s model. Calling one value “reward” and other “absence of reward” makes us wrongfully assume that destruction of the machinery corresponds to the latter.
No computer program can predict its own output before actually computing it. Thus, any computable agent will necessarily have to treat some aspect of itself as a black box.
If the agent isn’t stupid, and has reasonably good model of itself, and has some sort of goal in the form of “do not kill yourself”, then it will avoid messing with the parts of itself that it doesn’t understand (or at least touch them only if it has proved with substantial confidence that the modification will preserve functionality). It will also avoid breaking the parts of itself that it understands, obviously. Therefore it will not kill itself.
When evaluating conterfactual scenarios the hypothesis the agent considers is not “these signals magically appear in my output channel by some supernatural mean”, but “these signals may appear in my output channel due to some complex process that I can’t predict in full detail before I finish the current computation”.
To avoid ‘messing with the parts of itself’ it needs to be able to tell whenever actions do or do not mess with parts of itself. Moving oneself to another location, is that messing with itself or is that not? In non-isotropic universe, turning around could kill you, just as excessive accelerations could kill you in our universe.
I wouldn’t doubt that in principle you could hard-code some definition of what “parts of itself” are and what constitutes messing with them into an AI, so that it can non mess with those parts without knowing what they do, the point is that this won’t scale, and will break down if the AI gets too clever.
As for self preservation in AIXI-tl, there’s a curious anthropomorphization bias at play. Suppose that the reward was −0.999 and the lack of reward was −1 . The math that AIXI will work the same, but the common sense intuition switches from the mental image of a gluttonous hedonist that protects itself, to a tortured being yearning for death. In actuality, it’s neither, math of AIXI does not account for the destruction of the physical machinery in question one way or the other—it is neither a reward, nor lack of reward, it simply never happens in it’s model. Calling one value “reward” and other “absence of reward” makes us wrongfully assume that destruction of the machinery corresponds to the latter.