I thought of a phrase to quickly describe the gist of this problem: You need your AI to realize that the map is part of the territory.
Also, I was thinking that the fact that this is a problem might be a good thing. A Cartesian agent would probably be relatively slower at FOOMing, since it can’t natively conceive of modifying itself. (I still think a sufficiently intelligent one would still be highly dangerous and capable of FOOMing, though) A bigger advantage might be that it could potentially be used to control a ‘baby’ AI that is still being trained/built, since there is this huge blindspot in they way they can model the world. For example, imagine that a Cartesian AI is trying to increase its computational power, and it notices that there happens to be a lot of computational power right in easy access! So it starts reprogramming it to suit its own nefarious needs—but whoops, it just destroyed itself. Might act as a sort of fuse for a too ambitious AI. Or maybe, this could be used to more safely grow a seed AI—you tell it to write a design for a better version of itself. Then you could turn it off (which is easier to do since it is Cartesian), check that the design was sound, build it, and then work on the next generation AI, instead of trying to let it FOOM in controlled intervals. At some point, you could presumably ask it to solve this problem, and then design a new generation based on that. I don’t know how plausible these scenarios are, but it is interesting to think about.
You need your AI to realize that the map is part of the territory.
That’s right, if you mean ‘representations exist, so they must be implemented in physical systems’.
But the Cartesian agrees with ‘the map is part of the territory’ on a different interpretation. She thinks the mental and physical worlds both exist (as distinct ‘countries’ in a larger territory). Her error is just to think that it’s impossible to redescribe the mental parts of the universe in physical terms.
A Cartesian agent would probably be relatively slower at FOOMing
An attempt at a Cartesian seed AI would probably just break, unless it overcame its Cartesianness by some mostly autonomous evolutionary algorithm for generating successful successor-agents. A human programmer could try to improve it over time, but it wouldn’t be able to rely much on the AI’s own intelligence (because self-modification is precisely where the AI has no defined hypotheses), so I’d expect the process to become increasingly difficult and slow and ineffective as we reached the limits of human understanding.
I think the main worry with Cartesians isn’t that they’re dumb-ish, so they might become a dangerously unpredictable human-level AI or a bumbling superintelligence. The main worry is that they’re so dumb that they’ll never coalesce into a working general intelligence of any kind. Then, while the build-a-clean-AI people (who are trying to design simple, transparent AGIs with stable, defined goals) are busy wasting their time in the blind alley of Cartesian architectures, some random build-an-ugly-AI project will pop up out of left field and eat us.
Build-an-ugly-AI people care about sloppy, quick-and-dirty search processes, not so much about AIXI or Solomonoff. So the primary danger of Cartesians isn’t that they’re Unfriendly; it’s that they’re shiny objects distracting a lot of the people with the right tastes and competencies for making progress toward Friendliness.
The bootstrapping idea is probably a good one: There’s no way we’ll succeed at building a perfect FAI in one go, so the trick will be to cut corners in all the ways that can get fixed by the system, and that don’t make the system unsafe in the interim. I’m not sure Cartesianism is the right sort of corner to cut. Yes, the AI won’t care about self-preservation; but it also won’t care about any other interim values we’d like to program it with, except ones that amount to patterns of sensory experience for the AI.
The “build a clean Cartesian AI” folks, Schmidhuber and Hutter, are much closer to “describe how to build a clean naturalistic AI given unlimited computing power” than, say, Lenat’s Eurisko is to AIXI. It’s just that AIXI won’t actually work as a conceptual foundation for the reasons given, nay it is Solomonoff induction itself which will not work as a conceptual foundation, hence considering naturalized induction as part of the work to be done along the way to OPFAI. The worry from Eurisko-style AI is not that it will be Cartesian and therefore bad, but that it will do self-modification in a completely ad-hoc way and thus have no stable specifiable properties nor be apt to grafting on such. To avoid that, we want to do a cleaner system; and then, doing a cleaner system, we wish it to be naturalistic rather than Cartesian for the given reasons. Also, once you sketch out how a naturalistic system works, it’s very clear that these are issues central to stable self-modification—the system’s model of how it works and its attempt to change it.
I think you are conflating two different problems:
How to learn by reinforcement in an unknown non-ergodic environment (e.g. one where it is possible to drop an anvil on your head)
How to make decisions that take into account future reward, in a non-ergodic environment, where actions may modify the agent.
The first problem is well known the reinforcement learning community, and in fact it is mentioned also in the first AIXI papers, but it is sidestepped with an ergodicity assumption, rather than addressed. I don’t think there can be really general solutions for this problem: you need some environment-specific prior or supervision.
The second problem doesn’t seem as hard as the first one. AIXI, of course, can’t model self-modifications, because it is incomputable and it can only deal with computable environments, but computable varieties of AIXI (Schmidhuber’s Gödel machine, perhaps?) can easily represent themselves as part of the environment.
Yes, the AI won’t care about self-preservation; but it also won’t care about any other interim values we’d like to program it with, except ones that amount to patterns of sensory experience for the AI.
I get why AIXI would behave like this, but it’s not obvious to me that all Cartesian AIs would probably have this problem. If the AI has some model of the world, and this model can still update (mostly correctly) based on what the sensory channel inputs, and predict (mostly correctly) how different outputs can change the world, it seems like it could still try to maximize making as many paperclips as possible according to its model of the world. Does that make sense?
That’s a good point. AIXI is my go-to example, and AIXI’s preferences are over its input tape. But, sticking to the cybernetic agent model, there are other action-dependent things Alice could have preferences over, like portions of her work tape, or her actions themselves. She could also have preferences over input-conditional logical constructs out of Everett’s program, like Everett’s work tape contents.
I agree it’s possible to build a non-AIXI-like Cartesian that wants to make paperclips, not just produce paperclip-experiences in itself. But Cartesians are weird, so it’s hard to predict how much progress that would represent.
For example, the Cartesian might wirehead under the assumption that doing so changes reality, instead of wireheading under the assumption that doing so changes its experiences. I don’t know whether a deeply dualistic agent would recognize that editing its camera to create paperclip hallucinations counts as editing its input sequence semi-directly. It might instead think of camera-hacking as a godlike way of editing reality as a whole, as though Alice had the power to create billions of representations of objective physical paperclips in Everett’s work tape just by editing the part of Everett’s work tape representing her hardware.
In general, I’m worried about including anything reminiscent of Cartesian reasoning in our ‘the seed AI can help us solve this’ corner-cutting category, because I don’t formally understand the precise patterns of mistakes Cartesians make well enough to think I can predict them and stay two steps ahead of those errors. And in the time it takes to figure out exactly which patches would make Cartesians safe and predictable without rendering them useless, it’s plausible we could have just built a naturalized architecture from scratch.
I really appreciate your clear expositions!
I thought of a phrase to quickly describe the gist of this problem: You need your AI to realize that the map is part of the territory.
Also, I was thinking that the fact that this is a problem might be a good thing. A Cartesian agent would probably be relatively slower at FOOMing, since it can’t natively conceive of modifying itself. (I still think a sufficiently intelligent one would still be highly dangerous and capable of FOOMing, though) A bigger advantage might be that it could potentially be used to control a ‘baby’ AI that is still being trained/built, since there is this huge blindspot in they way they can model the world. For example, imagine that a Cartesian AI is trying to increase its computational power, and it notices that there happens to be a lot of computational power right in easy access! So it starts reprogramming it to suit its own nefarious needs—but whoops, it just destroyed itself. Might act as a sort of fuse for a too ambitious AI. Or maybe, this could be used to more safely grow a seed AI—you tell it to write a design for a better version of itself. Then you could turn it off (which is easier to do since it is Cartesian), check that the design was sound, build it, and then work on the next generation AI, instead of trying to let it FOOM in controlled intervals. At some point, you could presumably ask it to solve this problem, and then design a new generation based on that. I don’t know how plausible these scenarios are, but it is interesting to think about.
Thanks, Adele!
That’s right, if you mean ‘representations exist, so they must be implemented in physical systems’.
But the Cartesian agrees with ‘the map is part of the territory’ on a different interpretation. She thinks the mental and physical worlds both exist (as distinct ‘countries’ in a larger territory). Her error is just to think that it’s impossible to redescribe the mental parts of the universe in physical terms.
An attempt at a Cartesian seed AI would probably just break, unless it overcame its Cartesianness by some mostly autonomous evolutionary algorithm for generating successful successor-agents. A human programmer could try to improve it over time, but it wouldn’t be able to rely much on the AI’s own intelligence (because self-modification is precisely where the AI has no defined hypotheses), so I’d expect the process to become increasingly difficult and slow and ineffective as we reached the limits of human understanding.
I think the main worry with Cartesians isn’t that they’re dumb-ish, so they might become a dangerously unpredictable human-level AI or a bumbling superintelligence. The main worry is that they’re so dumb that they’ll never coalesce into a working general intelligence of any kind. Then, while the build-a-clean-AI people (who are trying to design simple, transparent AGIs with stable, defined goals) are busy wasting their time in the blind alley of Cartesian architectures, some random build-an-ugly-AI project will pop up out of left field and eat us.
Build-an-ugly-AI people care about sloppy, quick-and-dirty search processes, not so much about AIXI or Solomonoff. So the primary danger of Cartesians isn’t that they’re Unfriendly; it’s that they’re shiny objects distracting a lot of the people with the right tastes and competencies for making progress toward Friendliness.
The bootstrapping idea is probably a good one: There’s no way we’ll succeed at building a perfect FAI in one go, so the trick will be to cut corners in all the ways that can get fixed by the system, and that don’t make the system unsafe in the interim. I’m not sure Cartesianism is the right sort of corner to cut. Yes, the AI won’t care about self-preservation; but it also won’t care about any other interim values we’d like to program it with, except ones that amount to patterns of sensory experience for the AI.
The “build a clean Cartesian AI” folks, Schmidhuber and Hutter, are much closer to “describe how to build a clean naturalistic AI given unlimited computing power” than, say, Lenat’s Eurisko is to AIXI. It’s just that AIXI won’t actually work as a conceptual foundation for the reasons given, nay it is Solomonoff induction itself which will not work as a conceptual foundation, hence considering naturalized induction as part of the work to be done along the way to OPFAI. The worry from Eurisko-style AI is not that it will be Cartesian and therefore bad, but that it will do self-modification in a completely ad-hoc way and thus have no stable specifiable properties nor be apt to grafting on such. To avoid that, we want to do a cleaner system; and then, doing a cleaner system, we wish it to be naturalistic rather than Cartesian for the given reasons. Also, once you sketch out how a naturalistic system works, it’s very clear that these are issues central to stable self-modification—the system’s model of how it works and its attempt to change it.
I think you are conflating two different problems:
How to learn by reinforcement in an unknown non-ergodic environment (e.g. one where it is possible to drop an anvil on your head)
How to make decisions that take into account future reward, in a non-ergodic environment, where actions may modify the agent.
The first problem is well known the reinforcement learning community, and in fact it is mentioned also in the first AIXI papers, but it is sidestepped with an ergodicity assumption, rather than addressed.
I don’t think there can be really general solutions for this problem: you need some environment-specific prior or supervision.
The second problem doesn’t seem as hard as the first one.
AIXI, of course, can’t model self-modifications, because it is incomputable and it can only deal with computable environments, but computable varieties of AIXI (Schmidhuber’s Gödel machine, perhaps?) can easily represent themselves as part of the environment.
Thank you, this helps clarify things for me.
I get why AIXI would behave like this, but it’s not obvious to me that all Cartesian AIs would probably have this problem. If the AI has some model of the world, and this model can still update (mostly correctly) based on what the sensory channel inputs, and predict (mostly correctly) how different outputs can change the world, it seems like it could still try to maximize making as many paperclips as possible according to its model of the world. Does that make sense?
Alex Mennen designed a Cartesian with preferences over its environment: A utility-maximizing variant of AIXI.
That’s a good point. AIXI is my go-to example, and AIXI’s preferences are over its input tape. But, sticking to the cybernetic agent model, there are other action-dependent things Alice could have preferences over, like portions of her work tape, or her actions themselves. She could also have preferences over input-conditional logical constructs out of Everett’s program, like Everett’s work tape contents.
I agree it’s possible to build a non-AIXI-like Cartesian that wants to make paperclips, not just produce paperclip-experiences in itself. But Cartesians are weird, so it’s hard to predict how much progress that would represent.
For example, the Cartesian might wirehead under the assumption that doing so changes reality, instead of wireheading under the assumption that doing so changes its experiences. I don’t know whether a deeply dualistic agent would recognize that editing its camera to create paperclip hallucinations counts as editing its input sequence semi-directly. It might instead think of camera-hacking as a godlike way of editing reality as a whole, as though Alice had the power to create billions of representations of objective physical paperclips in Everett’s work tape just by editing the part of Everett’s work tape representing her hardware.
In general, I’m worried about including anything reminiscent of Cartesian reasoning in our ‘the seed AI can help us solve this’ corner-cutting category, because I don’t formally understand the precise patterns of mistakes Cartesians make well enough to think I can predict them and stay two steps ahead of those errors. And in the time it takes to figure out exactly which patches would make Cartesians safe and predictable without rendering them useless, it’s plausible we could have just built a naturalized architecture from scratch.