AGIs that cause worse-than-extinction outcomes are clustered around FAIs in design space.
Yes, that’s the part I’d like to see developed more. Maybe SI or FHI will get around to it eventually, but in the meantime I wouldn’t mind somebody like Wei Dai taking a crack at it.
Part of the problem in developing the argument is that you need a detailed concept of what a successful FAI design would look like, in order to then consider what similar-but-failed designs are like.
One approach is to think in terms of the utility function or goal system. Suppose that a true FAI has a utility function combining some long list of elemental values with a scheme for rating their importance. Variations away from this miss an essential value, add a false value, and/or get the recipe for combining elementary values wrong.
Another way to fail is to have the values right in principle but then to apply them wrongly in practice. My favorite example was, what if the AI thinks that some class of programs is conscious, when actually they aren’t. It might facilitate the creation of an upload civilization which is only a simulation of utopia and not actually a utopia. It might incorrectly attach moral significance to the nonexistent qualia of programs which aren’t conscious but which fake it. (Though neither of these is really “worse than extinction”. The first one, taken to its extreme, just is extinction, while the worst I can see coming from the second scenario is a type of “repugnant conclusion” where the conscious beings are made to endure privation for the sake of vast sim-populations that aren’t even conscious.)
Still another way to conceptualize “successful FAI design”, in order to then think about unsuccessful variations, is to think of the FAI as a developmental trajectory. The FAI is characterized by a set of initial conditions, such as a set of specific answers to the questions: how does it select its utility function, how does it self-modify, how does it obtain appropriate stability of values under self-modification. And then you would consider what goes wrong down the line, if you get one or more of those answers wrong.
I’m not sure what more can be said about “AGIs that cause worse-than-extinction outcomes are clustered around FAIs in design space”. It’s obvious, isn’t it?
I guess I could write about some FAI approaches being more likely to cause worse-than-extinction outcomes than others. For example, FAIs that are closely related to uploading or try to automatically extract values from humans seem riskier in this regard than FAIs where the values are coded directly and manually. But this also seems obvious and I’m not sure what I can usefully say beyond a couple of sentences.
FWIW, that superhuman environment-optimizers (e.g. AGIs) that obtain their target values from humans using an automatic process (e.g., uploading or extraction) are more likely to cause worse-than-extinction outcomes than those using a manual process (e.g. coding) is not obvious to me.
Yes, that’s the part I’d like to see developed more. Maybe SI or FHI will get around to it eventually, but in the meantime I wouldn’t mind somebody like Wei Dai taking a crack at it.
Part of the problem in developing the argument is that you need a detailed concept of what a successful FAI design would look like, in order to then consider what similar-but-failed designs are like.
One approach is to think in terms of the utility function or goal system. Suppose that a true FAI has a utility function combining some long list of elemental values with a scheme for rating their importance. Variations away from this miss an essential value, add a false value, and/or get the recipe for combining elementary values wrong.
Another way to fail is to have the values right in principle but then to apply them wrongly in practice. My favorite example was, what if the AI thinks that some class of programs is conscious, when actually they aren’t. It might facilitate the creation of an upload civilization which is only a simulation of utopia and not actually a utopia. It might incorrectly attach moral significance to the nonexistent qualia of programs which aren’t conscious but which fake it. (Though neither of these is really “worse than extinction”. The first one, taken to its extreme, just is extinction, while the worst I can see coming from the second scenario is a type of “repugnant conclusion” where the conscious beings are made to endure privation for the sake of vast sim-populations that aren’t even conscious.)
Still another way to conceptualize “successful FAI design”, in order to then think about unsuccessful variations, is to think of the FAI as a developmental trajectory. The FAI is characterized by a set of initial conditions, such as a set of specific answers to the questions: how does it select its utility function, how does it self-modify, how does it obtain appropriate stability of values under self-modification. And then you would consider what goes wrong down the line, if you get one or more of those answers wrong.
I’m not sure what more can be said about “AGIs that cause worse-than-extinction outcomes are clustered around FAIs in design space”. It’s obvious, isn’t it?
I guess I could write about some FAI approaches being more likely to cause worse-than-extinction outcomes than others. For example, FAIs that are closely related to uploading or try to automatically extract values from humans seem riskier in this regard than FAIs where the values are coded directly and manually. But this also seems obvious and I’m not sure what I can usefully say beyond a couple of sentences.
FWIW, that superhuman environment-optimizers (e.g. AGIs) that obtain their target values from humans using an automatic process (e.g., uploading or extraction) are more likely to cause worse-than-extinction outcomes than those using a manual process (e.g. coding) is not obvious to me.