I can’t answer for AISafetyIsNotLongtermist but I have similar (more optimistic) AI-risk forecasts. I can see four possible futures:
AGI does not arrive in our lifetimes
Utopia
Human extinction or s-risk due to misalignment
AI is aligned, but aligned to some very bad people who either kill large numbers of people or oppress us in some way.
The bulk of the probability mass is on Utopia for me. Future #1is always a possibility, and this community could be doing more to grow that possibility, since it is far far preferable to #3 and #4.
I think an important distinction is one you don’t make, where humans remain alive, but lose cosmic endowment. AGIs are probably not starting out as mature optimizers, so their terminal values are going to follow from a process of volition extrapolation, similar to that needed for humans. If they happen to hold enough human values from training on texts and other media, and don’t get tuned into something completely alien, it’s likely they at least give us some computational welfare sufficient for survival.
A less likely possibility is that somehow processes of volition extrapolation converge across different humans/AGIs to some currently illegible genericterminalvalues, in which case AGIs’ use of cosmic endowment is going to be valuable to humanity’s CEV as well, and human people are more likely to meaningfully participate.
#1 is a double-edged sword; it might help avoid #3 and #4 but might also avoid #2 (immortality). Although x-risk might be lower, billions will still suffer and die (assuming human-created medicine doesn’t progress fast enough) in a present and future similar to #3. OTOH, future humanity might run resurrection sims to “rescue” us for our current #3 situation. However, I don’t know if these sims are even possible for technical and philosophical reasons. From a self-preservation perspective, whether #1 is good or bad overall is not at all clear to me.
From a selfish perspective, sure let’s shoot for immortality in utopia. From a selfless perspective, I think it’s hard to argue that the earth should be destroyed just so that the people alive today can experience utopia, especially if we think that utopia will come eventually if we can be patient for a generation or two.
Okay, but does the Utopia option rest on more than a vague hope that alignment is possible? Is there something like an understandable (for non-experts) description of how to get there?
It sounds like your intuition is that alignment is hard. My view is that both corrigibility and value alignment are easy, much easier than general autonomous intelligence. We can’t really argue over intuitions though.
The way I see it, the sort of thinking that leads to pessimism about alignment starts and ends with an inability to distinguish optimization from intelligence. Indeed, if you define intelligence as “that which achieves optimization” then you’ve essentially defined for yourself an unsolvable problem. Fortunately, there are plenty of forms of intelligence that are not described by this pure consequentialist universalizing superoptimization concept (ie Clippy).
Consider a dog: a dog doesn’t try to take over the world, or even your house, but dogs are still more intelligent (able to operate in the physical world) than any robot, and dogs are also quite corrigible. Large numbers of humans are also corrigible, although I hesitate to try to describe a corrigible human because that will get into category debates that aren’t useful for what I’m trying to point at. My point is just that corrigibility is not rare, at any level of intelligence. I was trying to make this argument with my post The Bomb that doesn’t Explode but I don’t think I was clear enough.
Dogs and humans also can’t be used to get much leverage on pivotal acts.
A pivotal act, or a bunch of acts that add up to being pivotal, imply that the actor was taking actions that make the world end up some way. The only way we currently know to summon computer programs that take actions that make the world end up some way, is to run some kind of search (such as gradient descent) for computations that make the world end up some way. The simple way to make the world end up some way, is to look in general for actions that make the world end up some way. Since that’s the simple way, that’s what’s found by unstructured search. If a computer program makes the world end up some way by in general looking for and taking actions that make that happen, and that computer program can understand and modify itself, then, it is not corrigible, because corrigibility is in general a property that makes the world not end up the way the program is looking for actions to cause, so it would be self-modified away.
A robot with the intelligence and ability of a dog would be pretty economically useful without being dangerous. I’m working on a post to explore this with the title “Why do we want AI?”
To be honest, when you talk about pivotal acts, it looks like you are trying to take over the world.
Well, to be clear, I am not at all an expert on AI alignment—my impression from reading about the topic is that I find reasons for the impossibility of alignment agreeable while I did not yet find any test telling me why alignment should be easy. But maybe I’ll find that in your sequence, once that it consists of more posts?
Perhaps! I am working on more posts. I’m not necessarily trying to prove anything though, and I’m not an expert on AI alignment. Part of the point of writing is so that I can understand these issues better myself.
I can’t answer for AISafetyIsNotLongtermist but I have similar (more optimistic) AI-risk forecasts. I can see four possible futures:
AGI does not arrive in our lifetimes
Utopia
Human extinction or s-risk due to misalignment
AI is aligned, but aligned to some very bad people who either kill large numbers of people or oppress us in some way.
The bulk of the probability mass is on Utopia for me. Future #1is always a possibility, and this community could be doing more to grow that possibility, since it is far far preferable to #3 and #4.
I think an important distinction is one you don’t make, where humans remain alive, but lose cosmic endowment. AGIs are probably not starting out as mature optimizers, so their terminal values are going to follow from a process of volition extrapolation, similar to that needed for humans. If they happen to hold enough human values from training on texts and other media, and don’t get tuned into something completely alien, it’s likely they at least give us some computational welfare sufficient for survival.
A less likely possibility is that somehow processes of volition extrapolation converge across different humans/AGIs to some currently illegible generic terminal values, in which case AGIs’ use of cosmic endowment is going to be valuable to humanity’s CEV as well, and human people are more likely to meaningfully participate.
#1 is a double-edged sword; it might help avoid #3 and #4 but might also avoid #2 (immortality). Although x-risk might be lower, billions will still suffer and die (assuming human-created medicine doesn’t progress fast enough) in a present and future similar to #3. OTOH, future humanity might run resurrection sims to “rescue” us for our current #3 situation. However, I don’t know if these sims are even possible for technical and philosophical reasons. From a self-preservation perspective, whether #1 is good or bad overall is not at all clear to me.
From a selfish perspective, sure let’s shoot for immortality in utopia. From a selfless perspective, I think it’s hard to argue that the earth should be destroyed just so that the people alive today can experience utopia, especially if we think that utopia will come eventually if we can be patient for a generation or two.
Okay, but does the Utopia option rest on more than a vague hope that alignment is possible? Is there something like an understandable (for non-experts) description of how to get there?
It sounds like your intuition is that alignment is hard. My view is that both corrigibility and value alignment are easy, much easier than general autonomous intelligence. We can’t really argue over intuitions though.
Why is your view that corrigibility is easy?
The way I see it, the sort of thinking that leads to pessimism about alignment starts and ends with an inability to distinguish optimization from intelligence. Indeed, if you define intelligence as “that which achieves optimization” then you’ve essentially defined for yourself an unsolvable problem. Fortunately, there are plenty of forms of intelligence that are not described by this pure consequentialist universalizing superoptimization concept (ie Clippy).
Consider a dog: a dog doesn’t try to take over the world, or even your house, but dogs are still more intelligent (able to operate in the physical world) than any robot, and dogs are also quite corrigible. Large numbers of humans are also corrigible, although I hesitate to try to describe a corrigible human because that will get into category debates that aren’t useful for what I’m trying to point at. My point is just that corrigibility is not rare, at any level of intelligence. I was trying to make this argument with my post The Bomb that doesn’t Explode but I don’t think I was clear enough.
Dogs and humans also can’t be used to get much leverage on pivotal acts.
A pivotal act, or a bunch of acts that add up to being pivotal, imply that the actor was taking actions that make the world end up some way. The only way we currently know to summon computer programs that take actions that make the world end up some way, is to run some kind of search (such as gradient descent) for computations that make the world end up some way. The simple way to make the world end up some way, is to look in general for actions that make the world end up some way. Since that’s the simple way, that’s what’s found by unstructured search. If a computer program makes the world end up some way by in general looking for and taking actions that make that happen, and that computer program can understand and modify itself, then, it is not corrigible, because corrigibility is in general a property that makes the world not end up the way the program is looking for actions to cause, so it would be self-modified away.
A robot with the intelligence and ability of a dog would be pretty economically useful without being dangerous. I’m working on a post to explore this with the title “Why do we want AI?”
To be honest, when you talk about pivotal acts, it looks like you are trying to take over the world.
Not take over the world, but prevent pivot unaligned incorrigible AI from destroying the world.
Also, cross-domain optimisation doesn’t exist in an strong sense because of the no free lunch theorem.
Well, to be clear, I am not at all an expert on AI alignment—my impression from reading about the topic is that I find reasons for the impossibility of alignment agreeable while I did not yet find any test telling me why alignment should be easy. But maybe I’ll find that in your sequence, once that it consists of more posts?
Perhaps! I am working on more posts. I’m not necessarily trying to prove anything though, and I’m not an expert on AI alignment. Part of the point of writing is so that I can understand these issues better myself.