In the discussion about AI-based vs. upload-based singularities, and the expected utility of pushing for WBE (whole-brain emulation) first, has it been taken into account that an unfriendly AI is unlikely to do something worse than wiping out humanity, while the same isn’t necessarily true in an upload-based singularity?
“Yes” in the sense that people are aware of the argument, which goes back at least as far as Vernor Vinge, 1993, but “no” in the sense that there are also arguments that it may not be highly unlikely that a failed attempt at FAI will be worse than extinction (especially since some of the FAI proposals, such as Paul Christiano’s, are actually very closely related to uploading), and also “no” in the sense that we don’t know how to take into account considerations like this one except by using our intuitive judgments which seem extremely unreliable.
The non-negligible chance of waking up to a personal hell-world (including, partial+failed revivification) is the main non-akratic reason I’m not signed up for cryonics. I currently think AGI is coming sooner than WBE, but if WBE starts pulling ahead then I would be even more disinclined to sign up for cryonics.
Wei, do you know of any arguments better than XiXiDu’s that a failed attempt at FAI could very well be worse than extinction?
Wei, do you know of any arguments better than XiXiDu’s that a failed attempt at FAI could very well be worse than extinction?
I’m not aware of an especially good writeup, but here’s a general argument. Any attempt to build an AGI induces a distribution of possible outcomes, and specifically the distribution induced by an attempt at FAI can be thought of as a circle of uncertainty around an FAI in design space. AGIs that cause worse-than-extinction outcomes are clustered around FAIs in design space. So an attempt at FAI may be more likely to hit one of these worse-than-extinction AGIs than an attempt to build an AGI without consideration of Friendliness.
AGIs that cause worse-than-extinction outcomes are clustered around FAIs in design space.
Yes, that’s the part I’d like to see developed more. Maybe SI or FHI will get around to it eventually, but in the meantime I wouldn’t mind somebody like Wei Dai taking a crack at it.
Part of the problem in developing the argument is that you need a detailed concept of what a successful FAI design would look like, in order to then consider what similar-but-failed designs are like.
One approach is to think in terms of the utility function or goal system. Suppose that a true FAI has a utility function combining some long list of elemental values with a scheme for rating their importance. Variations away from this miss an essential value, add a false value, and/or get the recipe for combining elementary values wrong.
Another way to fail is to have the values right in principle but then to apply them wrongly in practice. My favorite example was, what if the AI thinks that some class of programs is conscious, when actually they aren’t. It might facilitate the creation of an upload civilization which is only a simulation of utopia and not actually a utopia. It might incorrectly attach moral significance to the nonexistent qualia of programs which aren’t conscious but which fake it. (Though neither of these is really “worse than extinction”. The first one, taken to its extreme, just is extinction, while the worst I can see coming from the second scenario is a type of “repugnant conclusion” where the conscious beings are made to endure privation for the sake of vast sim-populations that aren’t even conscious.)
Still another way to conceptualize “successful FAI design”, in order to then think about unsuccessful variations, is to think of the FAI as a developmental trajectory. The FAI is characterized by a set of initial conditions, such as a set of specific answers to the questions: how does it select its utility function, how does it self-modify, how does it obtain appropriate stability of values under self-modification. And then you would consider what goes wrong down the line, if you get one or more of those answers wrong.
I’m not sure what more can be said about “AGIs that cause worse-than-extinction outcomes are clustered around FAIs in design space”. It’s obvious, isn’t it?
I guess I could write about some FAI approaches being more likely to cause worse-than-extinction outcomes than others. For example, FAIs that are closely related to uploading or try to automatically extract values from humans seem riskier in this regard than FAIs where the values are coded directly and manually. But this also seems obvious and I’m not sure what I can usefully say beyond a couple of sentences.
FWIW, that superhuman environment-optimizers (e.g. AGIs) that obtain their target values from humans using an automatic process (e.g., uploading or extraction) are more likely to cause worse-than-extinction outcomes than those using a manual process (e.g. coding) is not obvious to me.
“Yes” in the sense that people are aware of the argument, which goes back at least as far as Vernor Vinge, 1993, but “no” in the sense that there are also arguments that it may not be highly unlikely that a failed attempt at FAI will be worse than extinction (especially since some of the FAI proposals, such as Paul Christiano’s, are actually very closely related to uploading), and also “no” in the sense that we don’t know how to take into account considerations like this one except by using our intuitive judgments which seem extremely unreliable.
The non-negligible chance of waking up to a personal hell-world (including, partial+failed revivification) is the main non-akratic reason I’m not signed up for cryonics. I currently think AGI is coming sooner than WBE, but if WBE starts pulling ahead then I would be even more disinclined to sign up for cryonics.
Wei, do you know of any arguments better than XiXiDu’s that a failed attempt at FAI could very well be worse than extinction?
I’m not aware of an especially good writeup, but here’s a general argument. Any attempt to build an AGI induces a distribution of possible outcomes, and specifically the distribution induced by an attempt at FAI can be thought of as a circle of uncertainty around an FAI in design space. AGIs that cause worse-than-extinction outcomes are clustered around FAIs in design space. So an attempt at FAI may be more likely to hit one of these worse-than-extinction AGIs than an attempt to build an AGI without consideration of Friendliness.
Yes, that’s the part I’d like to see developed more. Maybe SI or FHI will get around to it eventually, but in the meantime I wouldn’t mind somebody like Wei Dai taking a crack at it.
Part of the problem in developing the argument is that you need a detailed concept of what a successful FAI design would look like, in order to then consider what similar-but-failed designs are like.
One approach is to think in terms of the utility function or goal system. Suppose that a true FAI has a utility function combining some long list of elemental values with a scheme for rating their importance. Variations away from this miss an essential value, add a false value, and/or get the recipe for combining elementary values wrong.
Another way to fail is to have the values right in principle but then to apply them wrongly in practice. My favorite example was, what if the AI thinks that some class of programs is conscious, when actually they aren’t. It might facilitate the creation of an upload civilization which is only a simulation of utopia and not actually a utopia. It might incorrectly attach moral significance to the nonexistent qualia of programs which aren’t conscious but which fake it. (Though neither of these is really “worse than extinction”. The first one, taken to its extreme, just is extinction, while the worst I can see coming from the second scenario is a type of “repugnant conclusion” where the conscious beings are made to endure privation for the sake of vast sim-populations that aren’t even conscious.)
Still another way to conceptualize “successful FAI design”, in order to then think about unsuccessful variations, is to think of the FAI as a developmental trajectory. The FAI is characterized by a set of initial conditions, such as a set of specific answers to the questions: how does it select its utility function, how does it self-modify, how does it obtain appropriate stability of values under self-modification. And then you would consider what goes wrong down the line, if you get one or more of those answers wrong.
I’m not sure what more can be said about “AGIs that cause worse-than-extinction outcomes are clustered around FAIs in design space”. It’s obvious, isn’t it?
I guess I could write about some FAI approaches being more likely to cause worse-than-extinction outcomes than others. For example, FAIs that are closely related to uploading or try to automatically extract values from humans seem riskier in this regard than FAIs where the values are coded directly and manually. But this also seems obvious and I’m not sure what I can usefully say beyond a couple of sentences.
FWIW, that superhuman environment-optimizers (e.g. AGIs) that obtain their target values from humans using an automatic process (e.g., uploading or extraction) are more likely to cause worse-than-extinction outcomes than those using a manual process (e.g. coding) is not obvious to me.
Thanks!