Regarding your list, Eliezer has written extensively about exactly why those seem like good assumptions. If you want a quick summary though...
Human beings, at least some of us, appear to be generally intelligent. Unless you believe that this is due to a supernatural phenomenon (maybe souls are capable of hypercomputing?), general intelligence is thus demonstrably a thing that can exist in the natural world if matter is in the right configuration for it. Eventually, human engineering should be able to discover and create the right configuration.
Modern neural nets appear to work closely analogously to the brain, with neurons firing or not depending on which other neurons are firing and knowledge represented in which neurons are connected and how strongly. While it would require a bit of math to explain rigorously, this is a system that is capable of producing nearly any output due to any change in the input, and is thus flexible enough to reflect nearly any pattern. Backpropagation can in turn be used to find any patterns in the inputs (as well as more advanced techniques such as the Google Pathways system), and a program that knows the relevant patterns in what it’s looking at can both predict and optimize. If that isn’t obvious, consider that backprop can select for a program that predicts relevant results of the observed system, and that reversing this program allows for predicting which system states have a given result, which in turn allows for optimization. If this still isn’t obvious, I’d be happy to answer any questions you have in the comments; this part is complicated enough that trying to do it justice in a paragraph is difficult. Given that artificial neural nets appear to have generalizable prediction and optimization abilities though, it doesn’t seem too much of a stretch that researchers will be able to scale them up to a fully general understanding of the world this century, and quite possibly this decade.
Default nonalignment arises from simple entropy. There are an inconceivable number of possible goals in the world, and a mind created to fulfill one of them without careful specification is unlikely to end up with one of the very few goals that is consistent with human survival and flourishing. The obvious counterargument to this is that an AI isn’t likely to be created with a random goal; its creators are likely to at least give it instructions like “make everyone happy”. The counter-counterargument, however, is that our values are difficult to specify in terms that will make sense to a machine that doesn’t have human instincts. If I ask you to “make someone happy”, you implicitly understand a vast array of ideas that accompany the request: I’m asking you to help them out in a way that matches the sort of help people could give each other in normal life. A birthday present counts; wiring their brain’s pleasure centers up to a wall socket probably doesn’t; threatening to kill their loved ones if they don’t claim to be happy is right out. But just like computers learning simple code do exactly what you say without any instinctive understanding of what you really meant, a computer receiving a specification of what it ought to do on a world-changing scale will be prone to bugs where what we wanted and what we asked for diverge (which is the source of bugs today as well!)
This point relies on two things: collateral damage and the arbitrariness of values. The risk of collateral damage should be quite clear when considering what happens to other animals caught in the way of human projects. We tend not to even notice anthills bulldozed to make way for a new building. As for values, it is certainly possible to attempt to predict any given quantity, be it human happiness or the number of purple polka dots in the world. And turning that into optimizing for the quantity is as simple as picking actions that are predicted to result in the highest values of it. Nowhere along the line does anything like human decency enter the picture, not by default. If you have further questions about this I would recommend looking up the Orthogonality Thesis, the idea that any level of intelligence can coexist with any set of baseline values. Our values are certainly not arbitrary to us, but they do not appear to be part of the basic structure of math in a way that would force all possible minds to agree.
This isn’t just about corrigibility. An unaligned but perfectly corrigible AI (i.e. one that would follow any order to stop what it was doing and change its actions and values as directed) would still be a danger, as it would have excellent reason to ensure that we couldn’t give the order that would halt its plans! How dangerous a mind smarter than us could be is unpredictable (we could not, after all, know exactly what it would do without being that smart ourselves), but given both how easily humans are able to dominate even slightly less intelligent animals (the difference in intellect between a human and a chimpanzee is fairly small relative to the range of animal intelligence, and if we can make general AI at all, we can likely make one smarter than we are by a much larger margin than that between us and the other species) and that even within the range of plans humans have been able to think up, strategies like nanotech promise nearly total control of the world to anyone who can figure out the exact details, it seems unwise to expect to survive a conflict with a hostile superintelligence.
Certainly we have not yet solved alignment, and most existing alignment researchers have no clear idea of how progress can be made even in principle. This is one area where I personally diverge from the Less Wrong consensus a bit, however, as I suspect that it should be possible to create a viable alignment strategy by experimentation with AIs that are fairly powerful, but neither yet human level nor smart enough to pose the risks of a superintelligence. However, such a bootstrapping strategy is so far purely theoretical, and the current approach of trying to come up with human-understandable alignment strategies purely by human cognition has shown almost no progress thus far. There have been a few interesting ideas thrown around, such as Functional Decision Theory, an approach to making choices that avoids many common pitfalls, and Coherent Extrapolated Volition, a theory of value that seeks to avoid locking in our existing mistakes and misapprehensions. However, neither these ideas nor any other produced thus far by alignment researchers can be used in practice yet to prevent an AI from getting the wrong idea of what to pursue, nor from being lethally stubborn in pursuing that wrong idea.
A hostile superintelligence stands a decent chance of killing us all, or else of ensuring that we cannot take any action that could interfere with its goals. That’s quite a large first mover advantage.
At the risk of sounding incredibly cynical, the problem in convincing a great many AI researchers isn’t a matter of the convincingness or lack thereof of the arguments. Rather, most people simply follow habits and play roles, and any argument that they should change their comfortable routine will, for most people, be rejected out of hand. On the bright side, DeepMind, one of the leading organizations in the field of AI research, is actually somewhat interested in alignment, and has already done some work looking into how far a goal can be optimized before degenerate results occur. This doesn’t guarantee they’ll succeed, of course, and some researchers looking into the problem isn’t the same as a robust institutional AI safety culture. But it’s a very good sign that this story might have a happy ending after all, if people are sufficiently careful and smart.
Given all of this, the likelihood of world-ending AI fairly soon (timeline estimates vary, but I would not be at all surprised to see AGI this decade) and the difficulty of alignment, hopefully it is a little clearer now why so many here are concerned. That said, I think there is still quite a lot of hope, at least if the alignment community starts looking into experiments aimed at creating agents that can get better at understanding other agents’ values, and better at avoiding too much disruption along the way.
Thanks for the really insightful answer! I think I’m pretty much convinced on points 1, 2, 5, and 7, mostly agree with you on 6 and 8, and still don’t understand the sheer hopelessness of people who strongly believe 9. Assumptions 3, and 4, however, I’m not sure I fully follow, as it doesn’t seem like a slam dunk that the orthogonality thesis is true, as far as I can tell. I’d expect there to be basins of attraction towards some basic values, or convergence, sort of like carcinisation.
Carcinisation is an excellent metaphor for convergent instrumental values, i.e. values that are desired for ends other than themselves, and which can serve a wide variety of ends, and thus might be expected to occur in a wide variety of minds. In fact, there’s been some research on exactly that by Steve Omohundro, who defined the Omohundro Goals (well worth looking up). These are things like survival and preservation of your other goals, as it’s usually much easier to accomplish a thing if you remain alive to work on it, and continue to value doing so. However, orthogonality doesn’t apply to instrumental goals, which can do a good or bad job of serving as an effective path to other goals, and thus experience selection and carcinisation. Rather, it applies to terminal goals, those things we want purely for their own sake. It’s impossible to judge terminal goals as good or bad (except insofar as they accord or conflict with our own terminal goals, and that’s not a standard an AI automatically has to care about), as they are themselves the standard by which everything else is judged. The researcher Rob Miles has an excellent YouTube video about this you might enjoy entitled Intelligence and Stupidity: the Orthogonality Thesis, which goes into more depth. (Sorry for the lack of direct links; I’m sending this from my phone immediately before going to bed.)
Regarding your list, Eliezer has written extensively about exactly why those seem like good assumptions. If you want a quick summary though...
Human beings, at least some of us, appear to be generally intelligent. Unless you believe that this is due to a supernatural phenomenon (maybe souls are capable of hypercomputing?), general intelligence is thus demonstrably a thing that can exist in the natural world if matter is in the right configuration for it. Eventually, human engineering should be able to discover and create the right configuration.
Modern neural nets appear to work closely analogously to the brain, with neurons firing or not depending on which other neurons are firing and knowledge represented in which neurons are connected and how strongly. While it would require a bit of math to explain rigorously, this is a system that is capable of producing nearly any output due to any change in the input, and is thus flexible enough to reflect nearly any pattern. Backpropagation can in turn be used to find any patterns in the inputs (as well as more advanced techniques such as the Google Pathways system), and a program that knows the relevant patterns in what it’s looking at can both predict and optimize. If that isn’t obvious, consider that backprop can select for a program that predicts relevant results of the observed system, and that reversing this program allows for predicting which system states have a given result, which in turn allows for optimization. If this still isn’t obvious, I’d be happy to answer any questions you have in the comments; this part is complicated enough that trying to do it justice in a paragraph is difficult. Given that artificial neural nets appear to have generalizable prediction and optimization abilities though, it doesn’t seem too much of a stretch that researchers will be able to scale them up to a fully general understanding of the world this century, and quite possibly this decade.
Default nonalignment arises from simple entropy. There are an inconceivable number of possible goals in the world, and a mind created to fulfill one of them without careful specification is unlikely to end up with one of the very few goals that is consistent with human survival and flourishing. The obvious counterargument to this is that an AI isn’t likely to be created with a random goal; its creators are likely to at least give it instructions like “make everyone happy”. The counter-counterargument, however, is that our values are difficult to specify in terms that will make sense to a machine that doesn’t have human instincts. If I ask you to “make someone happy”, you implicitly understand a vast array of ideas that accompany the request: I’m asking you to help them out in a way that matches the sort of help people could give each other in normal life. A birthday present counts; wiring their brain’s pleasure centers up to a wall socket probably doesn’t; threatening to kill their loved ones if they don’t claim to be happy is right out. But just like computers learning simple code do exactly what you say without any instinctive understanding of what you really meant, a computer receiving a specification of what it ought to do on a world-changing scale will be prone to bugs where what we wanted and what we asked for diverge (which is the source of bugs today as well!)
This point relies on two things: collateral damage and the arbitrariness of values. The risk of collateral damage should be quite clear when considering what happens to other animals caught in the way of human projects. We tend not to even notice anthills bulldozed to make way for a new building. As for values, it is certainly possible to attempt to predict any given quantity, be it human happiness or the number of purple polka dots in the world. And turning that into optimizing for the quantity is as simple as picking actions that are predicted to result in the highest values of it. Nowhere along the line does anything like human decency enter the picture, not by default. If you have further questions about this I would recommend looking up the Orthogonality Thesis, the idea that any level of intelligence can coexist with any set of baseline values. Our values are certainly not arbitrary to us, but they do not appear to be part of the basic structure of math in a way that would force all possible minds to agree.
This isn’t just about corrigibility. An unaligned but perfectly corrigible AI (i.e. one that would follow any order to stop what it was doing and change its actions and values as directed) would still be a danger, as it would have excellent reason to ensure that we couldn’t give the order that would halt its plans! How dangerous a mind smarter than us could be is unpredictable (we could not, after all, know exactly what it would do without being that smart ourselves), but given both how easily humans are able to dominate even slightly less intelligent animals (the difference in intellect between a human and a chimpanzee is fairly small relative to the range of animal intelligence, and if we can make general AI at all, we can likely make one smarter than we are by a much larger margin than that between us and the other species) and that even within the range of plans humans have been able to think up, strategies like nanotech promise nearly total control of the world to anyone who can figure out the exact details, it seems unwise to expect to survive a conflict with a hostile superintelligence.
Certainly we have not yet solved alignment, and most existing alignment researchers have no clear idea of how progress can be made even in principle. This is one area where I personally diverge from the Less Wrong consensus a bit, however, as I suspect that it should be possible to create a viable alignment strategy by experimentation with AIs that are fairly powerful, but neither yet human level nor smart enough to pose the risks of a superintelligence. However, such a bootstrapping strategy is so far purely theoretical, and the current approach of trying to come up with human-understandable alignment strategies purely by human cognition has shown almost no progress thus far. There have been a few interesting ideas thrown around, such as Functional Decision Theory, an approach to making choices that avoids many common pitfalls, and Coherent Extrapolated Volition, a theory of value that seeks to avoid locking in our existing mistakes and misapprehensions. However, neither these ideas nor any other produced thus far by alignment researchers can be used in practice yet to prevent an AI from getting the wrong idea of what to pursue, nor from being lethally stubborn in pursuing that wrong idea.
A hostile superintelligence stands a decent chance of killing us all, or else of ensuring that we cannot take any action that could interfere with its goals. That’s quite a large first mover advantage.
At the risk of sounding incredibly cynical, the problem in convincing a great many AI researchers isn’t a matter of the convincingness or lack thereof of the arguments. Rather, most people simply follow habits and play roles, and any argument that they should change their comfortable routine will, for most people, be rejected out of hand. On the bright side, DeepMind, one of the leading organizations in the field of AI research, is actually somewhat interested in alignment, and has already done some work looking into how far a goal can be optimized before degenerate results occur. This doesn’t guarantee they’ll succeed, of course, and some researchers looking into the problem isn’t the same as a robust institutional AI safety culture. But it’s a very good sign that this story might have a happy ending after all, if people are sufficiently careful and smart.
Given all of this, the likelihood of world-ending AI fairly soon (timeline estimates vary, but I would not be at all surprised to see AGI this decade) and the difficulty of alignment, hopefully it is a little clearer now why so many here are concerned. That said, I think there is still quite a lot of hope, at least if the alignment community starts looking into experiments aimed at creating agents that can get better at understanding other agents’ values, and better at avoiding too much disruption along the way.
It might be helpful for formatting if you put the original list adjacent to your responses.
Good idea. Do you know how to turn off the automatic list numbering?
You can’t really do that, it’s a markdown feature. If you were to use asterisks (
*
), you could get bullet points.Thanks for the really insightful answer! I think I’m pretty much convinced on points 1, 2, 5, and 7, mostly agree with you on 6 and 8, and still don’t understand the sheer hopelessness of people who strongly believe 9. Assumptions 3, and 4, however, I’m not sure I fully follow, as it doesn’t seem like a slam dunk that the orthogonality thesis is true, as far as I can tell. I’d expect there to be basins of attraction towards some basic values, or convergence, sort of like carcinisation.
Carcinisation is an excellent metaphor for convergent instrumental values, i.e. values that are desired for ends other than themselves, and which can serve a wide variety of ends, and thus might be expected to occur in a wide variety of minds. In fact, there’s been some research on exactly that by Steve Omohundro, who defined the Omohundro Goals (well worth looking up). These are things like survival and preservation of your other goals, as it’s usually much easier to accomplish a thing if you remain alive to work on it, and continue to value doing so. However, orthogonality doesn’t apply to instrumental goals, which can do a good or bad job of serving as an effective path to other goals, and thus experience selection and carcinisation. Rather, it applies to terminal goals, those things we want purely for their own sake. It’s impossible to judge terminal goals as good or bad (except insofar as they accord or conflict with our own terminal goals, and that’s not a standard an AI automatically has to care about), as they are themselves the standard by which everything else is judged. The researcher Rob Miles has an excellent YouTube video about this you might enjoy entitled Intelligence and Stupidity: the Orthogonality Thesis, which goes into more depth. (Sorry for the lack of direct links; I’m sending this from my phone immediately before going to bed.)
• Intelligence And Stupidity by Rob Miles on YouTube
• Orthogonality Thesis on Arbital.com