So it discovers that destroying a particular building in NY made NY look plain black and made its effectors in NY not do anything. It infers from available evidence that NY still exists and is behaving as normal in other regards. It discovers similar buildings in other cities that have the same effect. At this point it can infer that destroying the magic building in a given city will make that city look black and its effectors in that city not move.
But how does it care? How does it make the leap from “I will receive blank sensory input from this location” to “my goals are less likely to be fulfilled”? It might observe that its goals seem easier to achieve in cities where the magic building is still present, but it can’t accurately model agents as complex as itself, and it’s got no way to treat itself differently from another “ally” that seems to be helping the same cause. Which… I can’t prove is irrational, but certainly seems a bit odd.
I originally thought the anvil problem was obviously correct once I’d seen it briefly described, but now I think (having read some of your other comments) that I might be confused in the same way as you. I suspect we are mistaken if we get hung up on the visceral or emotional identification with self, and that it is not essential; humans have it, but it should not matter to the presence or absence of anvil-type problems whether that feeling is present. (Possibly in the same way that UDT does not need to feel a ‘sense of self’ like we do in order to coordinate its instantiations?)
I am also wondering if the ‘sense of self and its preservation’ is being treated magically by my brain as something qualitatively different from the general class of things that cause systems to try to protect themselves because they ‘want’ to. (Does this phrase introduce backwards teleology?) It seems like the sense of self is possibly just extraneous to a system that ‘learns’ (though again we must be careful in using that word to avoid anthropomorphising) to output certain actions through reinforcement,
It should be irrelevant whether a process looks at something and manipulates it with motor outputs (based on what it’s learned through reinforcement?) ‘robotically’, or whether it manipulates that thing with motor outputs while making sounds about ‘me’ and its ‘rich emotional experience’ (or, heck, ‘qualia’). Maybe a witty way of saying this would be that ‘tabooing the sense of self should not affect decisions’? Obviously the set of things that sound the word ‘me’ and the set of things that do not are distinct, but it doesn’t seem like there are any inherent differences between those two sets that are relevant to avoiding danger?
It seems like it might also be an important observation, that humans are ‘created’ in motion such that they have certain self-preserving instincts, by virtue of the arrangement of ‘physical stuff’/atoms/etc. making them up. Any algorithm must also be implemented on an actual substrate (even if it is not what we would usually think of when we hear ‘physical’), and the particular implementation (the direction in which it is created in motion, so to speak) will affect its behaviour and its subsequent evolution as a system.
Another possible line of insight is that animals and even more simple systems (like a train that automatically throws its brakes at high speeds) do not seem to have (as strong) senses of self, yet they exhibit self-preservation. This is obvious (in retrospect?), but it makes me think that the key point of the anvil objection is that AIXI does not necessarily realise that self-preservation is valuable/does not necessarily become good at preserving itself, i.e. self-preservation is not a basic AIXI drive. But deducing the importance of self-preservation/managing to self-preserve seems so substrate-/location-dependent that this does not seem a legitimate criticism of AIXI in particular, but rather a general observation of a problem that occurs when instantiating an abstract algorithm in a world. (But now this sounds like an objection that I think RobbBB might already have addressed about ‘but any algorithm fails in a hostile environment’, so I really should actually read this post instead of buzzing around the comment!)
I’m starting to think this might be connected to my uneasiness with/confusion about some of the AI drives stuff.
Edit:
I should also note that the Cartesian presentation of AIXI’s functioning raised a flag with me. Maybe the anvil objection is unfair to AIXI because it criticises the abstract formalism of AIXI for not being self-preserving, but this is not fair because ‘exhibits self-preserving behaviour’ only matters (or even is only defined?) for actual instantiations of a decision algorithm, and criticisng the formalism for not exhibiting this property (what would that mean?) is a fully general argument against decision algorithms. (Okay, this is just me saying what I already said. But this seems like another useful way to think about it.)
More on created in motion etc.: I’m not sure if RobbBB is suggesting that AIXI falls short compared to humans specifically because we have bridging laws that allow self-preservation. But what if what one would call ‘humans using bridging laws’ is something like what being a self-aware, self-preserving, able-to-feel-anguish system feels like on the inside, in the same way that ‘pain’ might just be what being a conscious system with negative feedback systems feels like on the inside? RobbBB’s objection seems to be that AIXI is failing to meet a reasonable standard (i.e. using bridging laws to necessarily value self-preservation). But if humans are meant to be his existence proof that this standard can be met, I’m not sure they’re actually an example of it. And if they’re not supposed to be, then I’m wondering what an example of something that would meet this standard would even look like.
It might observe that its goals seem easier to achieve in cities where the magic building is still present,
I think you just answered your own question. Indeed, if the agent found that destroying its instances does not lead to less of its goals being achieved, then even a “naturalized” reasoner should not particularly care about destroying itself entirely.
Now, you say the agent would treat instances of itself the same way it would treat an ally. There’s a difference: An ally is someone who behaves in ways that benefit it, while an instance is something whose actions correlate with its output signal. The fact that it has a fine-grained control over instances of itself should lead it to treat itself differently from allies. But if the agent has an ally that completely reliably transmits to it true information and performs its requests, then yes, the agent should that ally the same way it treats parts of itself.
I think you just answered your own question. Indeed, if the agent found that destroying its instances does not lead to less of its goals being achieved, then even a “naturalized” reasoner should not particularly care about destroying itself entirely.
You can’t win, Vader. If you strike me down, I shall become more powerful than you can possibly imagine.
So it discovers that destroying a particular building in NY made NY look plain black and made its effectors in NY not do anything. It infers from available evidence that NY still exists and is behaving as normal in other regards. It discovers similar buildings in other cities that have the same effect. At this point it can infer that destroying the magic building in a given city will make that city look black and its effectors in that city not move.
But how does it care? How does it make the leap from “I will receive blank sensory input from this location” to “my goals are less likely to be fulfilled”? It might observe that its goals seem easier to achieve in cities where the magic building is still present, but it can’t accurately model agents as complex as itself, and it’s got no way to treat itself differently from another “ally” that seems to be helping the same cause. Which… I can’t prove is irrational, but certainly seems a bit odd.
I originally thought the anvil problem was obviously correct once I’d seen it briefly described, but now I think (having read some of your other comments) that I might be confused in the same way as you. I suspect we are mistaken if we get hung up on the visceral or emotional identification with self, and that it is not essential; humans have it, but it should not matter to the presence or absence of anvil-type problems whether that feeling is present. (Possibly in the same way that UDT does not need to feel a ‘sense of self’ like we do in order to coordinate its instantiations?)
I am also wondering if the ‘sense of self and its preservation’ is being treated magically by my brain as something qualitatively different from the general class of things that cause systems to try to protect themselves because they ‘want’ to. (Does this phrase introduce backwards teleology?) It seems like the sense of self is possibly just extraneous to a system that ‘learns’ (though again we must be careful in using that word to avoid anthropomorphising) to output certain actions through reinforcement,
It should be irrelevant whether a process looks at something and manipulates it with motor outputs (based on what it’s learned through reinforcement?) ‘robotically’, or whether it manipulates that thing with motor outputs while making sounds about ‘me’ and its ‘rich emotional experience’ (or, heck, ‘qualia’). Maybe a witty way of saying this would be that ‘tabooing the sense of self should not affect decisions’? Obviously the set of things that sound the word ‘me’ and the set of things that do not are distinct, but it doesn’t seem like there are any inherent differences between those two sets that are relevant to avoiding danger?
It seems like it might also be an important observation, that humans are ‘created’ in motion such that they have certain self-preserving instincts, by virtue of the arrangement of ‘physical stuff’/atoms/etc. making them up. Any algorithm must also be implemented on an actual substrate (even if it is not what we would usually think of when we hear ‘physical’), and the particular implementation (the direction in which it is created in motion, so to speak) will affect its behaviour and its subsequent evolution as a system.
Another possible line of insight is that animals and even more simple systems (like a train that automatically throws its brakes at high speeds) do not seem to have (as strong) senses of self, yet they exhibit self-preservation. This is obvious (in retrospect?), but it makes me think that the key point of the anvil objection is that AIXI does not necessarily realise that self-preservation is valuable/does not necessarily become good at preserving itself, i.e. self-preservation is not a basic AIXI drive. But deducing the importance of self-preservation/managing to self-preserve seems so substrate-/location-dependent that this does not seem a legitimate criticism of AIXI in particular, but rather a general observation of a problem that occurs when instantiating an abstract algorithm in a world. (But now this sounds like an objection that I think RobbBB might already have addressed about ‘but any algorithm fails in a hostile environment’, so I really should actually read this post instead of buzzing around the comment!)
I’m starting to think this might be connected to my uneasiness with/confusion about some of the AI drives stuff.
Edit: I should also note that the Cartesian presentation of AIXI’s functioning raised a flag with me. Maybe the anvil objection is unfair to AIXI because it criticises the abstract formalism of AIXI for not being self-preserving, but this is not fair because ‘exhibits self-preserving behaviour’ only matters (or even is only defined?) for actual instantiations of a decision algorithm, and criticisng the formalism for not exhibiting this property (what would that mean?) is a fully general argument against decision algorithms. (Okay, this is just me saying what I already said. But this seems like another useful way to think about it.)
More on created in motion etc.: I’m not sure if RobbBB is suggesting that AIXI falls short compared to humans specifically because we have bridging laws that allow self-preservation. But what if what one would call ‘humans using bridging laws’ is something like what being a self-aware, self-preserving, able-to-feel-anguish system feels like on the inside, in the same way that ‘pain’ might just be what being a conscious system with negative feedback systems feels like on the inside? RobbBB’s objection seems to be that AIXI is failing to meet a reasonable standard (i.e. using bridging laws to necessarily value self-preservation). But if humans are meant to be his existence proof that this standard can be met, I’m not sure they’re actually an example of it. And if they’re not supposed to be, then I’m wondering what an example of something that would meet this standard would even look like.
I think you just answered your own question. Indeed, if the agent found that destroying its instances does not lead to less of its goals being achieved, then even a “naturalized” reasoner should not particularly care about destroying itself entirely.
Now, you say the agent would treat instances of itself the same way it would treat an ally. There’s a difference: An ally is someone who behaves in ways that benefit it, while an instance is something whose actions correlate with its output signal. The fact that it has a fine-grained control over instances of itself should lead it to treat itself differently from allies. But if the agent has an ally that completely reliably transmits to it true information and performs its requests, then yes, the agent should that ally the same way it treats parts of itself.
You can’t win, Vader. If you strike me down, I shall become more powerful than you can possibly imagine.