That seems like extremely limited, human thinking. If we’re assuming a super powerful AGI, capable of wiping out humanity with high likelihood, it is also almost certainly capable of accomplishing its goals despite our theoretical attempts to stop it without needing to kill humans. The issue, then, is not fully aligning AGI goals with human goals, but ensuring it has “don’t wipe out humanity, don’t cause extreme negative impacts to humanity” somewhere in its utility function. Probably doesn’t even need to be weighted too strongly, if we’re talking about a truly powerful AGI. Chimpanzees presumably don’t want humans to rule the world—yet they have made no coherent effort to stop us from doing so, probably haven’t even realized we are doing so, and even if they did we could pretty easily ignore it.
“If something could get in the way (or even wants to get in the way, whether or not it is capable of trying) I need to wipe it out” is a sad, small mindset and I am entirely unconvinced that a significant portion of hypothetically likely AGIs would think this way. I think AGI will radically change the world, and maybe not for the better, but extinction seems like a hugely unlikely outcome.
Why would it “want” to keep humans around? How much do you care about whether or not you move dirt while you drive to work? If you don’t care about something at all, it won’t factor in to your choice of actions[1]
I know I phrased this tautologically, but I think the idiom will be clear. If not, just press me on it more. I think this is the best way to get the message across or I wouldn’t have done it.
some sort of general value for life, or a preference for decreased suffering of thinking beings, or the off chance we can do something to help (which i would argue is almost exactly the same low chance that we could do something to hurt it). I didn’t say there wasn’t an alignment problem, just that AGI whose goals don’t perfectly align with those of humanity in general isn’t necessarily catastrophic. Utility functions tend to have a lot of things they want to maximize, with different weights. Ensuring one or more of the above ideas is present in an AGI is important.
I gather the problem is that we cannot reliably incorporate that, or anything else, into a machine’s utility function: if it can change its source code (which would be the easiest way for it to bootstrap itself to superintelligence), it can also change its utility function in unpredictable ways. (Not necessarily on purpose, but the utility function can take collateral damage from other optimizations.)
I’m glad you started this thread: to someone like me who doesn’t follow AI safety closely, the argument starts to feel like, “Assume the machine is out to get us, and has an unstoppable ‘I Win’ button...” It’s worth knowing why some people think those are reasonable assumptions, and why (or if) others disagree with them. It would be great if there was an “AI Doom FAQ” to cover the basics and get newbies and dilettantes up to speed.
An excellent primer—thank you! I hope Scott revisits it someday, since it sounds like recent developments have narrowed the range of probable outcomes.
That seems like extremely limited, human thinking. If we’re assuming a super powerful AGI, capable of wiping out humanity with high likelihood, it is also almost certainly capable of accomplishing its goals despite our theoretical attempts to stop it without needing to kill humans.
If humans are capable of building one AGI, they certainly would be capable to build a second one which could have goals unaligned with the first one.
I assume that any unrestrained AGI would pretty much immediately exert enough control over the mechanisms through which an AGI might take power (say, the internet, nanotech, whatever else it thinks of) to ensure that no other AI could do so without its permission. I suppose it is plausible that humanity is capable of threatening an AGI through the creation of another, but that seems rather unlikely in practice. First-mover advantage is incalculable to an AGI.
That seems like extremely limited, human thinking. If we’re assuming a super powerful AGI, capable of wiping out humanity with high likelihood, it is also almost certainly capable of accomplishing its goals despite our theoretical attempts to stop it without needing to kill humans. The issue, then, is not fully aligning AGI goals with human goals, but ensuring it has “don’t wipe out humanity, don’t cause extreme negative impacts to humanity” somewhere in its utility function. Probably doesn’t even need to be weighted too strongly, if we’re talking about a truly powerful AGI. Chimpanzees presumably don’t want humans to rule the world—yet they have made no coherent effort to stop us from doing so, probably haven’t even realized we are doing so, and even if they did we could pretty easily ignore it.
“If something could get in the way (or even wants to get in the way, whether or not it is capable of trying) I need to wipe it out” is a sad, small mindset and I am entirely unconvinced that a significant portion of hypothetically likely AGIs would think this way. I think AGI will radically change the world, and maybe not for the better, but extinction seems like a hugely unlikely outcome.
Why would it “want” to keep humans around? How much do you care about whether or not you move dirt while you drive to work? If you don’t care about something at all, it won’t factor in to your choice of actions[1]
I know I phrased this tautologically, but I think the idiom will be clear. If not, just press me on it more. I think this is the best way to get the message across or I wouldn’t have done it.
some sort of general value for life, or a preference for decreased suffering of thinking beings, or the off chance we can do something to help (which i would argue is almost exactly the same low chance that we could do something to hurt it). I didn’t say there wasn’t an alignment problem, just that AGI whose goals don’t perfectly align with those of humanity in general isn’t necessarily catastrophic. Utility functions tend to have a lot of things they want to maximize, with different weights. Ensuring one or more of the above ideas is present in an AGI is important.
I think that if we can reliably incorporate that into a machine’s utility function, we’d be most of the way to alignment, right?
I gather the problem is that we cannot reliably incorporate that, or anything else, into a machine’s utility function: if it can change its source code (which would be the easiest way for it to bootstrap itself to superintelligence), it can also change its utility function in unpredictable ways. (Not necessarily on purpose, but the utility function can take collateral damage from other optimizations.)
I’m glad you started this thread: to someone like me who doesn’t follow AI safety closely, the argument starts to feel like, “Assume the machine is out to get us, and has an unstoppable ‘I Win’ button...” It’s worth knowing why some people think those are reasonable assumptions, and why (or if) others disagree with them. It would be great if there was an “AI Doom FAQ” to cover the basics and get newbies and dilettantes up to speed.
I’d recomend https://www.lesswrong.com/posts/LTtNXM9shNM9AC2mp/superintelligence-faq as a good starting point for newcomers.
An excellent primer—thank you! I hope Scott revisits it someday, since it sounds like recent developments have narrowed the range of probable outcomes.
If humans are capable of building one AGI, they certainly would be capable to build a second one which could have goals unaligned with the first one.
I assume that any unrestrained AGI would pretty much immediately exert enough control over the mechanisms through which an AGI might take power (say, the internet, nanotech, whatever else it thinks of) to ensure that no other AI could do so without its permission. I suppose it is plausible that humanity is capable of threatening an AGI through the creation of another, but that seems rather unlikely in practice. First-mover advantage is incalculable to an AGI.