The singleton-with-explicit-utility-function scenario certainly seems like a strong candidate for our future, but is it necessarily a given? Suppose an AI that is not Friendly (although possibly friendly with the lowercase ‘f’) with an unstable utility function- it alters its values based on experience, etc.
We know that this is possible to do in AGI, because it happens all the time in humans. The orthogonality thesis states that we can match any set of values to any intelligence. If we accept that at face value, it should be at least theoretically possible for any intelligence, even a superintelligence, to trade one set of values for another- provided it keeps to the set of values that permit self-edits of the utility function. The criterion by which the superintelligence alters its utility function might be inscrutably complex from a human perspective, but I can’t think of a reason why it would necessarily fall in to a permanent stable state.
The singleton-with-explicit-utility-function scenario certainly seems like a strong candidate for our future, but is it necessarily a given? Suppose an AI that is not Friendly (although possibly friendly with the lowercase ‘f’) with an unstable utility function- it alters its values based on experience, etc.
We know that this is possible to do in AGI, because it happens all the time in humans. The orthogonality thesis states that we can match any set of values to any intelligence. If we accept that at face value, it should be at least theoretically possible for any intelligence, even a superintelligence, to trade one set of values for another- provided it keeps to the set of values that permit self-edits of the utility function. The criterion by which the superintelligence alters its utility function might be inscrutably complex from a human perspective, but I can’t think of a reason why it would necessarily fall in to a permanent stable state.