Thanks for the questions. I’ll discuss this in a lot more detail in my next few posts.
if the AIXI is embedded in the real world in some way (like, say, on a laptop)
AIXI isn’t computable. Assuming our universe is computable, that means you can’t have an AIXI laptop.
But even if our universe turns out to be non-computable in a way that makes AIXI physically possible, we’ll still face the problem that AIXI only hypothesizes computable causes for its sensory information. Solomonoff inductors can never entertain the possibility of the existence of Solomonoff inductors. Much less believe in Solomonoff inductors. Much less believe that they are Solomonoff inductors.
It would then make sense to hypothesize that this laptop is indeed itself.
AIXI can’t represent ‘X is myself’. AIXI can only represent computer programs outputting strings of digits. Suppose we built an approximation of AIXI that includes itself in its hypothesis space, like AIXItl. Still AIXItl won’t be able to predict its own destruction, because destruction is not equivalent to any string of digits. Solomonoff inductors are designed to predict the next digit in an infinite series; halting programs aren’t in the hypothesis space.
So if AIXItl realizes that there’s a laptop in its environment whose hardware changes correlate especially strongly with the binary string, still AIXItl won’t be able to even think about the possibility ‘my computation is the hardware, and can be destroyed’.
We can try to patch that fundamental delusion (e.g., by making it believe that it will go to Hell if it smashes itself), but the worry is that such a delusion won’t be able to be localized in an AGI; it will infect other parts of the AGI’s world-view and utility function in ways that may be hard to predict and control. A general intelligence that permanently believes in Hell and mind-body dualism may come to a lot of strange and unexpected beliefs.
Or, we could just start it up and tell it, “Hey, this laptop is you”.
You can’t do that with Solomonoff induction, so we’ll need some new formalism that permits this. If we give a prior like this to the AGI, we’ll also need to be sure it’s an updatable hypothesis, not an axiom.
Well here is my take on how AIXI would handle these sorts of situations:
First, let’s assume it lives in a universe so which in any time t is in a state S(t) which is computable in terms of t. Now, AIXI finds this function S(t) interesting because it can be used to predict the input bits. More precisely, AIXI generates some function f which locate the machine running it and returns the input bits in that machine, and generates the model where its inputs in time t is f(S(t)). This function f is AIXI’s phenomenological bridge, it is naturally emergent from the formalism of AIXI. This does not take into account how AIXI’s model active has its future inputs depend on its current output, which would make the model more complicated.
Now suppose that AIXI considers an action with the result that in some time t’ the machine computing it in no longer exists. Then AIXI would be able to compute S(t’), but f(S(t’)) would no longer be well defined. What would AIXI do then? It will have to start using a different model for its inputs. Weather it will perform such an action depends on its predictions of its reward signal in this alternative. The exact result would be unpredictable, but one possible regularity would be that if it receives a very low reward signal then by regression to the mean it would expect to do better in these alternative models and would be in favor of actions which lead to its host machine’s destruction.
However, it gets more complicated than that. While in our intuitive models the AIXI’s input is no longer well defined when its host machine is destroyed, in its internal model the function f would probably be defined everywhere. For example, if its input are stored in a string of capacitors, its function f may be “electric fields in point x0, …, xn”, which is defined even when the capacitors are destroyed or displaced. A more interesting example would be if its inputs are generated from perfect-fidelity measurement of the physical world. Then the most favored hypothesis for f may be that f(s) is the measurement of those observables, and AIXI’s actions would optimize the physical parameter corresponding to its reward circuit regardless of what it predicts would happen to its reward circuit.
It gets even more interesting. Suppose such an AIXI predicts that its input stream will be tampered with. What would it do? Here, the part of its model which depends on its own output, which I previously ignored, becomes crucial. It would be reasonable for it to think as follows: When the inputs for its machine don’t match the physical parameters these inputs are supposed to measure, then AIXI’s prediction for its future input no longer matches the inputs the machine receives. Therefore the machine’s actions should no longer match AIXI’s intentions, but AIXI’s reward signal will still be at the mercy of this machine. This would generally be assigned a suboptimal utility and be avoided. However, AIXI’s model for its output circuit may be that it influences the physical state even after its host machine no longer implements it. In that case, it would not be reluctant to tamper with its input circuit.
Overall, AIXI’s actions eerily resemble the way humans behave.
I affirm all of this response except the last sentence. I don’t think humans go wrong in quite the same way...
No?
Scenario 1:
would be in favor of actions which lead to its host machine’s destruction.
Soldiers do that when they volunteer to go on suicide missions.
Scenario 2:
actions would optimize the physical parameter corresponding to its reward circuit regardless of what it predicts would happen to its reward circuit.
That’s the reason people write wills.
Scenario 3:
AIXI’s model for its output circuit may be that it influences the physical state even after its host machine no longer implements it. In that case, it would not be reluctant to tamper with its input circuit.
That’s how addicts behave. Or even non-addicts when they choose to imbibe (and possibly drive afterward).
These are some wide analogies. But analogies are not the best approach to reason about something if we already know important details, which happen to be different.
The specific details of human thinking and acting are different from the specific details of AIXI functioning. Sometimes an analogical thing happens. Sometimes not. And the only way to know when the situation is analogical, is when you already know it.
I agree that these analogies might be superficial, I simply noted that they exist in reply to Eliezer stating “I don’t think humans go wrong in quite the same way...”
The specific details of human thinking and acting are different from the specific details of AIXI functioning.
Do we really know the “specific details of human thinking and acting” to make this statement?
Do we really know the “specific details of human thinking and acting” to make this statement?
I believe we know quite enough to consider is pretty unlikely that human brain stores an infinite number of binary descriptions of Turing machines along with their probabilities which are initialized by Somonoff induction at birth (or perhaps at conception) and later updated on evidence according to the Bayes theorem.
Even if words like “inifinity” or “incomputable” are not convincing enough (okay, perhaps the human brain runs the AIXI algorithm with some unimportant rounding), there are things like human-specific biases generated by evolutionary pressures—which is one of the main points of this whole website.
Even if words like “inifinity” or “incomputable” are not convincing enough
Presumably any realizable version of AIXI, like AIXItl, would have to use a finite amount of computations, so no.
there are things like human-specific biases generated by evolutionary pressures
Right. However some of those could be due to improper weighting of some of the models, or poor priors, etc. I am not sure that the case is as closed as you seem to imply.
AIXI can’t represent ‘X is myself’. AIXI can only represent computer programs outputting strings of digits. Suppose we built an approximation of AIXI that includes itself in its hypothesis space, like AIXItl. Still AIXItl won’t be able to predict its own destruction, because destruction is not equivalent to any string of digits.
Well, no.
If you start a reinforcement learning agent (AIXI, AIXItl or whatever) as a blank slate, and allow it to perform unsafe actions, then it can certainly destroy itself: it’s a baby playing with a loaded gun. That’s why convergence proofs in reinforcement learning papers often make ergodicity assumptions about the environment (it’s not like EY was the first one to have thought of this problem).
But if you give the agent a sufficiently accurate model of the world, or ban it from performing unsafe actions until it has built such a model from experience, then it will be able to infer that certain courses of action lead to world states where it’s ability to gain high rewards becomes permanently compromised, e.g. states where the agent is “dead”, even if it has never experienced these states first handedly (after all, that’s what induction is for).
Indeed, adult humans don’t have a full model of themselves and their environment, and many of them even believe that they have some sorts of “uncomputable” mechanisms (supernatural souls) and yet they tend not to drop anvils on their heads.
then it will be able to infer that certain courses of action lead to world states where it’s ability to gain high rewards becomes permanently compromised
The problem is that it will be unable to identify all of such actions.
In the reality, AIXI (AIXI-tl or the like) ’s actions are computed by some hardware, but AIXI’s model the counter-factual future actions are somehow inserted into the model rather than computed within the model. Consequently, some of the physical hardware that is computing actual actions by the AIXI is not represented in the model as computing those actions.
Now, suppose that you see a circuit diagram where there’s some highly important hardware (e.g. transistors that drive CPU output pins), and an unnecessary, unpredictably noisy resistive load connected in parallel with it… you’ll remove that unnecessary and noisy load.
edit: and for the AI safety, this is of course very good news, just the same as the tendency of hot matter to disperse around is very good news when it comes to nuclear proliferation.
In the reality, AIXI (AIXI-tl or the like) ’s actions are computed by some hardware, but AIXI’s model the counter-factual future actions are somehow inserted into the model rather than computed within the model. Consequently, some of the physical hardware that is computing actual actions by the AIXI is not represented in the model as computing those actions. Now, suppose that you see a circuit diagram where there’s some highly important hardware (e.g. transistors that drive CPU output pins), and an unnecessary, unpredictably noisy resistive load connected in parallel with it… you’ll remove that unnecessary and noisy load.
It’s not obvious to me that a reinforcement learning agent with a sufficiently accurate model of the world would do that. Humans don’t. At most, a reinforcement learning agent capable of self-modification would tend to wirehead itself.
edit: and for the AI safety, this is of course very good news, just the same as the tendency of hot matter to disperse around is very good news when it comes to nuclear proliferation.
IIUC, nuclear proliferation is limited by the fact that enriched uranium and plutonium are hard to acquire. Once you have got the fissile materials, making a nuclear bomb isn’t probably much more complicated than making a conventional modern warhead.
The fact that hot matter tends to disperse is relevant to the safety of nuclear reactors: they can’t explode like nuclear bombs because if they ever reach prompt supercriticality, they quickly destroy themselves before a significant amount of fuel undergoes fission.
I don’t think AI safety is a particularly pressing concern at the moment mainly because I don’t buy the “intelligence explosion” narrative, which in fact neither EY nor MIRI were ever able to convincingly argue for.
It’s not obvious to me that a reinforcement learning agent with a sufficiently accurate model of the world would do that. Humans don’t.
Humans do all sorts of things and then those that kill themselves do not talk to other people afterwards.
The problem is not with accuracy, or rather, not with low accuracy but rather with the overly high accuracy. The issue is that in the world model we have to try potential actions that we could do, which we need to somehow introduce into the model. We can say that those actions are produced by this black box the computer which needs to be supplied power, and so on. Then this box, as a whole, is, of course, protected from destruction. It is when accuracy increases—we start looking into the internals, start resolving how that black box works—that this breaks down.
At most, a reinforcement learning agent capable of self-modification would tend to wirehead itself.
This is usually argued against by pointing at something like AIXI.
Once you have got the fissile materials, making a nuclear bomb isn’t probably much more complicated than making a conventional modern warhead.
It’s still a big obstacle (and the simpler gun type design requires considerably more fissile material). If some terrorists stole, say, 2 critical masses of plutonium, they would be unable to build a bomb.
Albeit I agree that nuclear reactors are a much better analogy. The accidental intelligence explosion at the sort of extremely short timescales is nonsense even if intelligence explosion is a theoretical possibility, in part because the systems not built under assumption of intelligence explosion would not implement necessary self-protection, but would rely on simple solutions that only work as long as it can not think a way around them.
It is when accuracy increases—we start looking into the internals, start resolving how that black box works—that this breaks down.
No computer program can predict its own output before actually computing it. Thus, any computable agent will necessarily have to treat some aspect of itself as a black box. If the agent isn’t stupid, and has reasonably good model of itself, and has some sort of goal in the form of “do not kill yourself”, then it will avoid messing with the parts of itself that it doesn’t understand (or at least touch them only if it has proved with substantial confidence that the modification will preserve functionality). It will also avoid breaking the parts of itself that it understands, obviously. Therefore it will not kill itself.
When evaluating conterfactual scenarios the hypothesis the agent considers is not “these signals magically appear in my output channel by some supernatural mean”, but “these signals may appear in my output channel due to some complex process that I can’t predict in full detail before I finish the current computation”.
To avoid ‘messing with the parts of itself’ it needs to be able to tell whenever actions do or do not mess with parts of itself. Moving oneself to another location, is that messing with itself or is that not? In non-isotropic universe, turning around could kill you, just as excessive accelerations could kill you in our universe.
I wouldn’t doubt that in principle you could hard-code some definition of what “parts of itself” are and what constitutes messing with them into an AI, so that it can non mess with those parts without knowing what they do, the point is that this won’t scale, and will break down if the AI gets too clever.
As for self preservation in AIXI-tl, there’s a curious anthropomorphization bias at play. Suppose that the reward was −0.999 and the lack of reward was −1 . The math that AIXI will work the same, but the common sense intuition switches from the mental image of a gluttonous hedonist that protects itself, to a tortured being yearning for death. In actuality, it’s neither, math of AIXI does not account for the destruction of the physical machinery in question one way or the other—it is neither a reward, nor lack of reward, it simply never happens in it’s model. Calling one value “reward” and other “absence of reward” makes us wrongfully assume that destruction of the machinery corresponds to the latter.
AIXI isn’t computable. Assuming our universe is computable, that means you can’t have an AIXI laptop.
Well, I was referring to Eliezer’s hypercomputable laptop example. And we are assuming that somewhere in this machine there is a bit that causes ‘experiencing blue’. And I’m also not talking about the AIXI building a model of itself or its own behavior in any way.
Still AIXItl won’t be able to predict its own destruction, because destruction is not equivalent to any string of digits.
Destruction implies ‘no more reward after this point.’
Let me give a more thorough example. Let’s say our AIXI (or AIXItl) laptop has found the bit that corresponds to its reward. Whenever this bit is flipped, it experiences reward (there are good arguments that it will not incessantly flip this bit until it dies, but there’s nothing preventing it from realizing the bit’s existence). It is able to infer that if this bit is destroyed, that will not at all be conducive to maximizing its reward. Thus it makes it a priority to keep this bit intact, and by extension it makes it a priority to keep the laptop intact.
Nowhere in the above example have I included ‘self’. I suspect a similar process goes on with humans. If I had cancer in my leg, I would not object to my leg being cut off to save my life. If, however, one day, someone came over to me and showed me some new scientific literature conclusively demostrating that the leg is in fact the seat of intelligence and consciousness, and that the brain just does passive stuff like regulate heartbeat, I’d have a much more different attitude. I’d want to preserve my leg at all costs (imagine for a moment that I have not seen or heard of anyone with their leg amputated or brain dysfunctional, so I have no prior knowledge about such things). Nowhere has introspection entered this picture!
A bit has two possible values, 0 and 1, it doesn’t have “destroyed” value. And, of course, it’s not enough to prefer intact laptop over one wholly pulverised into plasma (the latter severs the cable connecting the laptop to the internet, after all). There’s necessarily some parts of the laptop which in reality are necessary to compute AIXI but which in AIXI’s model do not compute the output actions of AIXI (as potential actions are magic’d into the model rather than arise through processes in the model). Those parts are liable to being tampered with, especially as any changes in their behaviour would be grossly misunderstood (e.g. increase in clock speed would be misunderstood as world slowing down).
edit: And especially as existence of those parts is detrimental to the operation of the protected part of the laptop (due to their power consumption, heat, risk of short circuit failure taking out the protected part, etc). Somewhat simplifying, in reality there’s the CPU internals that compute AIXI, and there’s the bus controller that sends the actions onto the bus, to eventually act on the real world, and that reads the rewards. In AIXI’s model of self, there’s useless hardware (deep internals whose output is substituted for with potential actions) that is connected to the same power supply as the critically important components (that relay the actions and rewards), endangering their operation.
A bit has two possible values, 0 and 1, it doesn’t have “destroyed” value.
A physical bit does. Remember that we are talking about an actual bit stored inside a memory location on the computer (say, a capacitor in a DRAM cell).
And, of course, it’s not enough to prefer intact laptop over one wholly pulverised into plasma
Why not? Not recieving any future reward is such a huge negative utility that it would take a very large positive utility to carry out an action that would risk that occuring. Would you allow a surgeon to remove some section of your brain for $1,000,000 even if you knew that that section would not affect your reward pathways?
Would you allow a surgeon to remove some section of your brain for $1,000,000 even if you knew that that section would not affect your reward pathways?
If I had brain cancer or cerebral AVM or the like, I’d pay to have it removed. See my edit. The root issue is that in AIXI’s model, potential actions (that it iterates through) are not represented as output of some hardware, but are forced onto the model. Consequently the hardware that actually outputs those in the real world is not represented as critical. And it is connected in parallel onto the same power supply (as the understood-to-be-critically-important hardware which relays the actions). It literally thinks it got a brain parasite. Of course it won’t necessarily drop an anvil at the whole thing just because of experimenting—that’s patently stupid. It will surgically excise some parts with great caution.
Thanks for the questions. I’ll discuss this in a lot more detail in my next few posts.
AIXI isn’t computable. Assuming our universe is computable, that means you can’t have an AIXI laptop.
But even if our universe turns out to be non-computable in a way that makes AIXI physically possible, we’ll still face the problem that AIXI only hypothesizes computable causes for its sensory information. Solomonoff inductors can never entertain the possibility of the existence of Solomonoff inductors. Much less believe in Solomonoff inductors. Much less believe that they are Solomonoff inductors.
AIXI can’t represent ‘X is myself’. AIXI can only represent computer programs outputting strings of digits. Suppose we built an approximation of AIXI that includes itself in its hypothesis space, like AIXItl. Still AIXItl won’t be able to predict its own destruction, because destruction is not equivalent to any string of digits. Solomonoff inductors are designed to predict the next digit in an infinite series; halting programs aren’t in the hypothesis space.
So if AIXItl realizes that there’s a laptop in its environment whose hardware changes correlate especially strongly with the binary string, still AIXItl won’t be able to even think about the possibility ‘my computation is the hardware, and can be destroyed’.
We can try to patch that fundamental delusion (e.g., by making it believe that it will go to Hell if it smashes itself), but the worry is that such a delusion won’t be able to be localized in an AGI; it will infect other parts of the AGI’s world-view and utility function in ways that may be hard to predict and control. A general intelligence that permanently believes in Hell and mind-body dualism may come to a lot of strange and unexpected beliefs.
You can’t do that with Solomonoff induction, so we’ll need some new formalism that permits this. If we give a prior like this to the AGI, we’ll also need to be sure it’s an updatable hypothesis, not an axiom.
Well here is my take on how AIXI would handle these sorts of situations:
First, let’s assume it lives in a universe so which in any time t is in a state S(t) which is computable in terms of t. Now, AIXI finds this function S(t) interesting because it can be used to predict the input bits. More precisely, AIXI generates some function f which locate the machine running it and returns the input bits in that machine, and generates the model where its inputs in time t is f(S(t)). This function f is AIXI’s phenomenological bridge, it is naturally emergent from the formalism of AIXI. This does not take into account how AIXI’s model active has its future inputs depend on its current output, which would make the model more complicated.
Now suppose that AIXI considers an action with the result that in some time t’ the machine computing it in no longer exists. Then AIXI would be able to compute S(t’), but f(S(t’)) would no longer be well defined. What would AIXI do then? It will have to start using a different model for its inputs. Weather it will perform such an action depends on its predictions of its reward signal in this alternative. The exact result would be unpredictable, but one possible regularity would be that if it receives a very low reward signal then by regression to the mean it would expect to do better in these alternative models and would be in favor of actions which lead to its host machine’s destruction.
However, it gets more complicated than that. While in our intuitive models the AIXI’s input is no longer well defined when its host machine is destroyed, in its internal model the function f would probably be defined everywhere. For example, if its input are stored in a string of capacitors, its function f may be “electric fields in point x0, …, xn”, which is defined even when the capacitors are destroyed or displaced. A more interesting example would be if its inputs are generated from perfect-fidelity measurement of the physical world. Then the most favored hypothesis for f may be that f(s) is the measurement of those observables, and AIXI’s actions would optimize the physical parameter corresponding to its reward circuit regardless of what it predicts would happen to its reward circuit.
It gets even more interesting. Suppose such an AIXI predicts that its input stream will be tampered with. What would it do? Here, the part of its model which depends on its own output, which I previously ignored, becomes crucial. It would be reasonable for it to think as follows: When the inputs for its machine don’t match the physical parameters these inputs are supposed to measure, then AIXI’s prediction for its future input no longer matches the inputs the machine receives. Therefore the machine’s actions should no longer match AIXI’s intentions, but AIXI’s reward signal will still be at the mercy of this machine. This would generally be assigned a suboptimal utility and be avoided. However, AIXI’s model for its output circuit may be that it influences the physical state even after its host machine no longer implements it. In that case, it would not be reluctant to tamper with its input circuit.
Overall, AIXI’s actions eerily resemble the way humans behave.
These scenarios call for SMBC-like comic strip illustrations. Maybe ping Zach?
I affirm all of this response except the last sentence. I don’t think humans go wrong in quite the same way...
No?
Scenario 1:
Soldiers do that when they volunteer to go on suicide missions.
Scenario 2:
That’s the reason people write wills.
Scenario 3:
That’s how addicts behave. Or even non-addicts when they choose to imbibe (and possibly drive afterward).
These are some wide analogies. But analogies are not the best approach to reason about something if we already know important details, which happen to be different.
The specific details of human thinking and acting are different from the specific details of AIXI functioning. Sometimes an analogical thing happens. Sometimes not. And the only way to know when the situation is analogical, is when you already know it.
I agree that these analogies might be superficial, I simply noted that they exist in reply to Eliezer stating “I don’t think humans go wrong in quite the same way...”
Do we really know the “specific details of human thinking and acting” to make this statement?
I believe we know quite enough to consider is pretty unlikely that human brain stores an infinite number of binary descriptions of Turing machines along with their probabilities which are initialized by Somonoff induction at birth (or perhaps at conception) and later updated on evidence according to the Bayes theorem.
Even if words like “inifinity” or “incomputable” are not convincing enough (okay, perhaps the human brain runs the AIXI algorithm with some unimportant rounding), there are things like human-specific biases generated by evolutionary pressures—which is one of the main points of this whole website.
Seriously, the case is closed.
Presumably any realizable version of AIXI, like AIXItl, would have to use a finite amount of computations, so no.
Right. However some of those could be due to improper weighting of some of the models, or poor priors, etc. I am not sure that the case is as closed as you seem to imply.
Well, no.
If you start a reinforcement learning agent (AIXI, AIXItl or whatever) as a blank slate, and allow it to perform unsafe actions, then it can certainly destroy itself: it’s a baby playing with a loaded gun.
That’s why convergence proofs in reinforcement learning papers often make ergodicity assumptions about the environment (it’s not like EY was the first one to have thought of this problem).
But if you give the agent a sufficiently accurate model of the world, or ban it from performing unsafe actions until it has built such a model from experience, then it will be able to infer that certain courses of action lead to world states where it’s ability to gain high rewards becomes permanently compromised, e.g. states where the agent is “dead”, even if it has never experienced these states first handedly (after all, that’s what induction is for).
Indeed, adult humans don’t have a full model of themselves and their environment, and many of them even believe that they have some sorts of “uncomputable” mechanisms (supernatural souls) and yet they tend not to drop anvils on their heads.
The problem is that it will be unable to identify all of such actions.
In the reality, AIXI (AIXI-tl or the like) ’s actions are computed by some hardware, but AIXI’s model the counter-factual future actions are somehow inserted into the model rather than computed within the model. Consequently, some of the physical hardware that is computing actual actions by the AIXI is not represented in the model as computing those actions.
Now, suppose that you see a circuit diagram where there’s some highly important hardware (e.g. transistors that drive CPU output pins), and an unnecessary, unpredictably noisy resistive load connected in parallel with it… you’ll remove that unnecessary and noisy load.
edit: and for the AI safety, this is of course very good news, just the same as the tendency of hot matter to disperse around is very good news when it comes to nuclear proliferation.
It’s not obvious to me that a reinforcement learning agent with a sufficiently accurate model of the world would do that. Humans don’t.
At most, a reinforcement learning agent capable of self-modification would tend to wirehead itself.
IIUC, nuclear proliferation is limited by the fact that enriched uranium and plutonium are hard to acquire. Once you have got the fissile materials, making a nuclear bomb isn’t probably much more complicated than making a conventional modern warhead.
The fact that hot matter tends to disperse is relevant to the safety of nuclear reactors: they can’t explode like nuclear bombs because if they ever reach prompt supercriticality, they quickly destroy themselves before a significant amount of fuel undergoes fission.
I don’t think AI safety is a particularly pressing concern at the moment mainly because I don’t buy the “intelligence explosion” narrative, which in fact neither EY nor MIRI were ever able to convincingly argue for.
Humans do all sorts of things and then those that kill themselves do not talk to other people afterwards.
The problem is not with accuracy, or rather, not with low accuracy but rather with the overly high accuracy. The issue is that in the world model we have to try potential actions that we could do, which we need to somehow introduce into the model. We can say that those actions are produced by this black box the computer which needs to be supplied power, and so on. Then this box, as a whole, is, of course, protected from destruction. It is when accuracy increases—we start looking into the internals, start resolving how that black box works—that this breaks down.
This is usually argued against by pointing at something like AIXI.
It’s still a big obstacle (and the simpler gun type design requires considerably more fissile material). If some terrorists stole, say, 2 critical masses of plutonium, they would be unable to build a bomb.
Albeit I agree that nuclear reactors are a much better analogy. The accidental intelligence explosion at the sort of extremely short timescales is nonsense even if intelligence explosion is a theoretical possibility, in part because the systems not built under assumption of intelligence explosion would not implement necessary self-protection, but would rely on simple solutions that only work as long as it can not think a way around them.
No computer program can predict its own output before actually computing it. Thus, any computable agent will necessarily have to treat some aspect of itself as a black box.
If the agent isn’t stupid, and has reasonably good model of itself, and has some sort of goal in the form of “do not kill yourself”, then it will avoid messing with the parts of itself that it doesn’t understand (or at least touch them only if it has proved with substantial confidence that the modification will preserve functionality). It will also avoid breaking the parts of itself that it understands, obviously. Therefore it will not kill itself.
When evaluating conterfactual scenarios the hypothesis the agent considers is not “these signals magically appear in my output channel by some supernatural mean”, but “these signals may appear in my output channel due to some complex process that I can’t predict in full detail before I finish the current computation”.
To avoid ‘messing with the parts of itself’ it needs to be able to tell whenever actions do or do not mess with parts of itself. Moving oneself to another location, is that messing with itself or is that not? In non-isotropic universe, turning around could kill you, just as excessive accelerations could kill you in our universe.
I wouldn’t doubt that in principle you could hard-code some definition of what “parts of itself” are and what constitutes messing with them into an AI, so that it can non mess with those parts without knowing what they do, the point is that this won’t scale, and will break down if the AI gets too clever.
As for self preservation in AIXI-tl, there’s a curious anthropomorphization bias at play. Suppose that the reward was −0.999 and the lack of reward was −1 . The math that AIXI will work the same, but the common sense intuition switches from the mental image of a gluttonous hedonist that protects itself, to a tortured being yearning for death. In actuality, it’s neither, math of AIXI does not account for the destruction of the physical machinery in question one way or the other—it is neither a reward, nor lack of reward, it simply never happens in it’s model. Calling one value “reward” and other “absence of reward” makes us wrongfully assume that destruction of the machinery corresponds to the latter.
Well, I was referring to Eliezer’s hypercomputable laptop example. And we are assuming that somewhere in this machine there is a bit that causes ‘experiencing blue’. And I’m also not talking about the AIXI building a model of itself or its own behavior in any way.
Destruction implies ‘no more reward after this point.’
Let me give a more thorough example. Let’s say our AIXI (or AIXItl) laptop has found the bit that corresponds to its reward. Whenever this bit is flipped, it experiences reward (there are good arguments that it will not incessantly flip this bit until it dies, but there’s nothing preventing it from realizing the bit’s existence). It is able to infer that if this bit is destroyed, that will not at all be conducive to maximizing its reward. Thus it makes it a priority to keep this bit intact, and by extension it makes it a priority to keep the laptop intact.
Nowhere in the above example have I included ‘self’. I suspect a similar process goes on with humans. If I had cancer in my leg, I would not object to my leg being cut off to save my life. If, however, one day, someone came over to me and showed me some new scientific literature conclusively demostrating that the leg is in fact the seat of intelligence and consciousness, and that the brain just does passive stuff like regulate heartbeat, I’d have a much more different attitude. I’d want to preserve my leg at all costs (imagine for a moment that I have not seen or heard of anyone with their leg amputated or brain dysfunctional, so I have no prior knowledge about such things). Nowhere has introspection entered this picture!
A bit has two possible values, 0 and 1, it doesn’t have “destroyed” value. And, of course, it’s not enough to prefer intact laptop over one wholly pulverised into plasma (the latter severs the cable connecting the laptop to the internet, after all). There’s necessarily some parts of the laptop which in reality are necessary to compute AIXI but which in AIXI’s model do not compute the output actions of AIXI (as potential actions are magic’d into the model rather than arise through processes in the model). Those parts are liable to being tampered with, especially as any changes in their behaviour would be grossly misunderstood (e.g. increase in clock speed would be misunderstood as world slowing down).
edit: And especially as existence of those parts is detrimental to the operation of the protected part of the laptop (due to their power consumption, heat, risk of short circuit failure taking out the protected part, etc). Somewhat simplifying, in reality there’s the CPU internals that compute AIXI, and there’s the bus controller that sends the actions onto the bus, to eventually act on the real world, and that reads the rewards. In AIXI’s model of self, there’s useless hardware (deep internals whose output is substituted for with potential actions) that is connected to the same power supply as the critically important components (that relay the actions and rewards), endangering their operation.
A physical bit does. Remember that we are talking about an actual bit stored inside a memory location on the computer (say, a capacitor in a DRAM cell).
Why not? Not recieving any future reward is such a huge negative utility that it would take a very large positive utility to carry out an action that would risk that occuring. Would you allow a surgeon to remove some section of your brain for $1,000,000 even if you knew that that section would not affect your reward pathways?
If I had brain cancer or cerebral AVM or the like, I’d pay to have it removed. See my edit. The root issue is that in AIXI’s model, potential actions (that it iterates through) are not represented as output of some hardware, but are forced onto the model. Consequently the hardware that actually outputs those in the real world is not represented as critical. And it is connected in parallel onto the same power supply (as the understood-to-be-critically-important hardware which relays the actions). It literally thinks it got a brain parasite. Of course it won’t necessarily drop an anvil at the whole thing just because of experimenting—that’s patently stupid. It will surgically excise some parts with great caution.