AIXI can’t represent ‘X is myself’. AIXI can only represent computer programs outputting strings of digits. Suppose we built an approximation of AIXI that includes itself in its hypothesis space, like AIXItl. Still AIXItl won’t be able to predict its own destruction, because destruction is not equivalent to any string of digits.
Well, no.
If you start a reinforcement learning agent (AIXI, AIXItl or whatever) as a blank slate, and allow it to perform unsafe actions, then it can certainly destroy itself: it’s a baby playing with a loaded gun. That’s why convergence proofs in reinforcement learning papers often make ergodicity assumptions about the environment (it’s not like EY was the first one to have thought of this problem).
But if you give the agent a sufficiently accurate model of the world, or ban it from performing unsafe actions until it has built such a model from experience, then it will be able to infer that certain courses of action lead to world states where it’s ability to gain high rewards becomes permanently compromised, e.g. states where the agent is “dead”, even if it has never experienced these states first handedly (after all, that’s what induction is for).
Indeed, adult humans don’t have a full model of themselves and their environment, and many of them even believe that they have some sorts of “uncomputable” mechanisms (supernatural souls) and yet they tend not to drop anvils on their heads.
then it will be able to infer that certain courses of action lead to world states where it’s ability to gain high rewards becomes permanently compromised
The problem is that it will be unable to identify all of such actions.
In the reality, AIXI (AIXI-tl or the like) ’s actions are computed by some hardware, but AIXI’s model the counter-factual future actions are somehow inserted into the model rather than computed within the model. Consequently, some of the physical hardware that is computing actual actions by the AIXI is not represented in the model as computing those actions.
Now, suppose that you see a circuit diagram where there’s some highly important hardware (e.g. transistors that drive CPU output pins), and an unnecessary, unpredictably noisy resistive load connected in parallel with it… you’ll remove that unnecessary and noisy load.
edit: and for the AI safety, this is of course very good news, just the same as the tendency of hot matter to disperse around is very good news when it comes to nuclear proliferation.
In the reality, AIXI (AIXI-tl or the like) ’s actions are computed by some hardware, but AIXI’s model the counter-factual future actions are somehow inserted into the model rather than computed within the model. Consequently, some of the physical hardware that is computing actual actions by the AIXI is not represented in the model as computing those actions. Now, suppose that you see a circuit diagram where there’s some highly important hardware (e.g. transistors that drive CPU output pins), and an unnecessary, unpredictably noisy resistive load connected in parallel with it… you’ll remove that unnecessary and noisy load.
It’s not obvious to me that a reinforcement learning agent with a sufficiently accurate model of the world would do that. Humans don’t. At most, a reinforcement learning agent capable of self-modification would tend to wirehead itself.
edit: and for the AI safety, this is of course very good news, just the same as the tendency of hot matter to disperse around is very good news when it comes to nuclear proliferation.
IIUC, nuclear proliferation is limited by the fact that enriched uranium and plutonium are hard to acquire. Once you have got the fissile materials, making a nuclear bomb isn’t probably much more complicated than making a conventional modern warhead.
The fact that hot matter tends to disperse is relevant to the safety of nuclear reactors: they can’t explode like nuclear bombs because if they ever reach prompt supercriticality, they quickly destroy themselves before a significant amount of fuel undergoes fission.
I don’t think AI safety is a particularly pressing concern at the moment mainly because I don’t buy the “intelligence explosion” narrative, which in fact neither EY nor MIRI were ever able to convincingly argue for.
It’s not obvious to me that a reinforcement learning agent with a sufficiently accurate model of the world would do that. Humans don’t.
Humans do all sorts of things and then those that kill themselves do not talk to other people afterwards.
The problem is not with accuracy, or rather, not with low accuracy but rather with the overly high accuracy. The issue is that in the world model we have to try potential actions that we could do, which we need to somehow introduce into the model. We can say that those actions are produced by this black box the computer which needs to be supplied power, and so on. Then this box, as a whole, is, of course, protected from destruction. It is when accuracy increases—we start looking into the internals, start resolving how that black box works—that this breaks down.
At most, a reinforcement learning agent capable of self-modification would tend to wirehead itself.
This is usually argued against by pointing at something like AIXI.
Once you have got the fissile materials, making a nuclear bomb isn’t probably much more complicated than making a conventional modern warhead.
It’s still a big obstacle (and the simpler gun type design requires considerably more fissile material). If some terrorists stole, say, 2 critical masses of plutonium, they would be unable to build a bomb.
Albeit I agree that nuclear reactors are a much better analogy. The accidental intelligence explosion at the sort of extremely short timescales is nonsense even if intelligence explosion is a theoretical possibility, in part because the systems not built under assumption of intelligence explosion would not implement necessary self-protection, but would rely on simple solutions that only work as long as it can not think a way around them.
It is when accuracy increases—we start looking into the internals, start resolving how that black box works—that this breaks down.
No computer program can predict its own output before actually computing it. Thus, any computable agent will necessarily have to treat some aspect of itself as a black box. If the agent isn’t stupid, and has reasonably good model of itself, and has some sort of goal in the form of “do not kill yourself”, then it will avoid messing with the parts of itself that it doesn’t understand (or at least touch them only if it has proved with substantial confidence that the modification will preserve functionality). It will also avoid breaking the parts of itself that it understands, obviously. Therefore it will not kill itself.
When evaluating conterfactual scenarios the hypothesis the agent considers is not “these signals magically appear in my output channel by some supernatural mean”, but “these signals may appear in my output channel due to some complex process that I can’t predict in full detail before I finish the current computation”.
To avoid ‘messing with the parts of itself’ it needs to be able to tell whenever actions do or do not mess with parts of itself. Moving oneself to another location, is that messing with itself or is that not? In non-isotropic universe, turning around could kill you, just as excessive accelerations could kill you in our universe.
I wouldn’t doubt that in principle you could hard-code some definition of what “parts of itself” are and what constitutes messing with them into an AI, so that it can non mess with those parts without knowing what they do, the point is that this won’t scale, and will break down if the AI gets too clever.
As for self preservation in AIXI-tl, there’s a curious anthropomorphization bias at play. Suppose that the reward was −0.999 and the lack of reward was −1 . The math that AIXI will work the same, but the common sense intuition switches from the mental image of a gluttonous hedonist that protects itself, to a tortured being yearning for death. In actuality, it’s neither, math of AIXI does not account for the destruction of the physical machinery in question one way or the other—it is neither a reward, nor lack of reward, it simply never happens in it’s model. Calling one value “reward” and other “absence of reward” makes us wrongfully assume that destruction of the machinery corresponds to the latter.
Well, no.
If you start a reinforcement learning agent (AIXI, AIXItl or whatever) as a blank slate, and allow it to perform unsafe actions, then it can certainly destroy itself: it’s a baby playing with a loaded gun.
That’s why convergence proofs in reinforcement learning papers often make ergodicity assumptions about the environment (it’s not like EY was the first one to have thought of this problem).
But if you give the agent a sufficiently accurate model of the world, or ban it from performing unsafe actions until it has built such a model from experience, then it will be able to infer that certain courses of action lead to world states where it’s ability to gain high rewards becomes permanently compromised, e.g. states where the agent is “dead”, even if it has never experienced these states first handedly (after all, that’s what induction is for).
Indeed, adult humans don’t have a full model of themselves and their environment, and many of them even believe that they have some sorts of “uncomputable” mechanisms (supernatural souls) and yet they tend not to drop anvils on their heads.
The problem is that it will be unable to identify all of such actions.
In the reality, AIXI (AIXI-tl or the like) ’s actions are computed by some hardware, but AIXI’s model the counter-factual future actions are somehow inserted into the model rather than computed within the model. Consequently, some of the physical hardware that is computing actual actions by the AIXI is not represented in the model as computing those actions.
Now, suppose that you see a circuit diagram where there’s some highly important hardware (e.g. transistors that drive CPU output pins), and an unnecessary, unpredictably noisy resistive load connected in parallel with it… you’ll remove that unnecessary and noisy load.
edit: and for the AI safety, this is of course very good news, just the same as the tendency of hot matter to disperse around is very good news when it comes to nuclear proliferation.
It’s not obvious to me that a reinforcement learning agent with a sufficiently accurate model of the world would do that. Humans don’t.
At most, a reinforcement learning agent capable of self-modification would tend to wirehead itself.
IIUC, nuclear proliferation is limited by the fact that enriched uranium and plutonium are hard to acquire. Once you have got the fissile materials, making a nuclear bomb isn’t probably much more complicated than making a conventional modern warhead.
The fact that hot matter tends to disperse is relevant to the safety of nuclear reactors: they can’t explode like nuclear bombs because if they ever reach prompt supercriticality, they quickly destroy themselves before a significant amount of fuel undergoes fission.
I don’t think AI safety is a particularly pressing concern at the moment mainly because I don’t buy the “intelligence explosion” narrative, which in fact neither EY nor MIRI were ever able to convincingly argue for.
Humans do all sorts of things and then those that kill themselves do not talk to other people afterwards.
The problem is not with accuracy, or rather, not with low accuracy but rather with the overly high accuracy. The issue is that in the world model we have to try potential actions that we could do, which we need to somehow introduce into the model. We can say that those actions are produced by this black box the computer which needs to be supplied power, and so on. Then this box, as a whole, is, of course, protected from destruction. It is when accuracy increases—we start looking into the internals, start resolving how that black box works—that this breaks down.
This is usually argued against by pointing at something like AIXI.
It’s still a big obstacle (and the simpler gun type design requires considerably more fissile material). If some terrorists stole, say, 2 critical masses of plutonium, they would be unable to build a bomb.
Albeit I agree that nuclear reactors are a much better analogy. The accidental intelligence explosion at the sort of extremely short timescales is nonsense even if intelligence explosion is a theoretical possibility, in part because the systems not built under assumption of intelligence explosion would not implement necessary self-protection, but would rely on simple solutions that only work as long as it can not think a way around them.
No computer program can predict its own output before actually computing it. Thus, any computable agent will necessarily have to treat some aspect of itself as a black box.
If the agent isn’t stupid, and has reasonably good model of itself, and has some sort of goal in the form of “do not kill yourself”, then it will avoid messing with the parts of itself that it doesn’t understand (or at least touch them only if it has proved with substantial confidence that the modification will preserve functionality). It will also avoid breaking the parts of itself that it understands, obviously. Therefore it will not kill itself.
When evaluating conterfactual scenarios the hypothesis the agent considers is not “these signals magically appear in my output channel by some supernatural mean”, but “these signals may appear in my output channel due to some complex process that I can’t predict in full detail before I finish the current computation”.
To avoid ‘messing with the parts of itself’ it needs to be able to tell whenever actions do or do not mess with parts of itself. Moving oneself to another location, is that messing with itself or is that not? In non-isotropic universe, turning around could kill you, just as excessive accelerations could kill you in our universe.
I wouldn’t doubt that in principle you could hard-code some definition of what “parts of itself” are and what constitutes messing with them into an AI, so that it can non mess with those parts without knowing what they do, the point is that this won’t scale, and will break down if the AI gets too clever.
As for self preservation in AIXI-tl, there’s a curious anthropomorphization bias at play. Suppose that the reward was −0.999 and the lack of reward was −1 . The math that AIXI will work the same, but the common sense intuition switches from the mental image of a gluttonous hedonist that protects itself, to a tortured being yearning for death. In actuality, it’s neither, math of AIXI does not account for the destruction of the physical machinery in question one way or the other—it is neither a reward, nor lack of reward, it simply never happens in it’s model. Calling one value “reward” and other “absence of reward” makes us wrongfully assume that destruction of the machinery corresponds to the latter.