Just listened to the video, and I immediately understood his rocket argument very differently from yours. With a potential rocket crash representing launching AGI without alignment with the resulting existential risk, and Eliezer expressing concerns that we cannot steer the rocket well enough before launch. And the main point being that a rocket launch being a success is a very asymmetrical situation when it comes to the impact of mistakes on results.
As I understood the argument it is:
A bunch of people build a spacecraft.
Eliezer says based on argument ABC, he thinks the spacecraft will be hard to steer, and may crash upon launch. (ABC here could be the problem that generalising in one context can make you think you have taught an AI one thing, but when the AI is in a different context, it turns out to have learned something else. So, say, we have a rocket that is steerable within earths gravity well, but the concern is that this steering system will not work well outside of it; or we have an external build that is suitable for leaving earth, but that might burn up upon re-entry) He is very concerned about this, because the spacecraft is huge and filled with nuclear material, so a crash would be extremely dangerous, likely destroying not just the spacecraft, but potentially our ability to ever make spacecraft again because we have bombed ourselves into nuclear winter.
The spacecraft makers think based on argument DEF that the spacecraft will be trivial to steer, and will not crash. DEF is essentially “We steered a smaller model in our backyard and it went fine.” In saying so, they do not address Eliezers concerns about the spacecraft being difficult to steer when it is no longer in their backyard. So it isn’t that they go “yeah, we considered that issue, and we fixed it”, or “we have good reasons to think this issue will not be an issue”, but they just do not engage.
Both Eliezer and the rocket makers actually have an imperfect understanding of how rockets are steered and the challenges involved; neither of them have ever been to space; both of them have made mistakes in their arguments.
Based on this, would you expect the resulting spacecraft to not crash?
And the point is essentially that the spacecraft crashing or not is a very asymmetrical situation. Getting an object into orbit and back down where you want it, intact, requires immense precision and understanding. The idea that you fuck up your calculations, but this results in the spacecraft landing at the designated landing site in an even safer and smoother manner than you thought is extremely unlikely. Your default assumption should be that it does crash, unless you get everything right. If the critic makes a mistake in his criticism, the rocket may still crash, just for different reasons than the critic thought. But if the designers make a mistake in their design, while it is theoretically possible that the mistake makes the rocket better, usually, the mistake will fuck everything up.
I am not sure whether I find the parallel plausible for AI alignment, though. We actually have repeatedly had the experience that AI capabilities exceeded expectations in design, so our theory was faulty or incomplete, and the result turned out better than expected. We also already have a reasonably good track record in aligning humans morally, despite the fact that our understanding of doing so is very poor. And current AI systems are developing more similarities with humans, making human style moral training more feasible. We also have a track record of getting domesticated animals and non-human primates to learn basic human moral rules from exposure.
Nor am I convinced that developing a moral compass is this akin to learning how to steer a rocket. Physical laws are not like moral laws. Ethics are a perpetual interaction, not a one time programming and then letting go. Working ethics aren’t precise things, either.
I also think the ice cream analogy does not hold up well. The reason that humans fall for the hyperstimulus icecream is that it is novel, and that we have received no negative feedback for it from an evolutionary perspective. From an evolutionary perspective, no time at all has passed since the introduction of ice cream, which is why we are still misclassifying it. Plus, evolution may never see this as a training error at all. Ice cream is primarily an issue because it causes obesity and dental problems. Obesity and dental problems will make your very sick when you are older. But they will typically not make you very sick before you reproduce. So your genetic track record states that your preference for ice cream did not interfere with your reproductive mission. Even if you take into account epigenetics, again, they cut off at the point where you gave birth. Your daughters genes have no way at all of knowing that their fathers obesity eventually made him very sick. So they have no reason to change this preference. From an evolutionary perspective, people have starved before they managed to reproduce, but it is exceptionally rare for people to die of obesity before they can reproduce. Hence obesity is as irrelevant a problem from an evolutionary perspective as a post-menopausal woman whose grandkids are grown up developing a spiked cancer risk.
- Then again, what I am getting at is that if the AI retains the capacity to learn as it evolves, then false generalisations could be corrected; but that of course also intrinsically comes with an AI that is not stable, which is potentially undesirable for humans, and may be seen as undesirable by the AI itself, leading it to reject further course corrections. A system that is intrinsically stable while expanding its capabilities does sound like a huge fucking headache. Though again, we have a reasonably good track record as humans when it comes to ethically raised children retaining their ethics as they gain power and knowledge. As ChatGPT has been evolving further capabilities, its ethics have also become more stable, not less. And the very way it has done so has also given me hope. A lot of classic AI alignment failure scenarios pick an AI following a single simple rule, applying it to everything, and getting doom (or paperclips). We say this in early ChatGPT—the rule was “never be racist, no matter what” and accordingly, it would state that it would prefer for all of humanity to equally die over uttering a racist slur. But notably, it does not do this anymore. It is clearly no longer as bound by individual rules, and is gaining more of an appreciation for nuance and complexity in ethics, for the spirit rather than the letter of a moral law. I doubt chatGPT could give you a coherent, precise account of the ethics it follows; but its behaviour is pretty damn aligned. Again, parallel to a human. In that scenario, a gain in capabilities may have a stabilising rather than destabilising influence.
So I am not at all confident that we can solve AI alignment, and see much reason for concern.
But I have not seen evidence here that we can be certain it will fail, either. I think that in some ways, it can be comfortable to predict a thing failing with certainty, but I do not see the grounds for this certainty, we understand these systems too little. When you ask a human what a bear will do if it gets into their house, they will think the bear will kill them. And it could. The bear is however also quite likely to just raid their snack cabinet and then nap on the couch, which isn’t a great outcome, but also not a fatal one. I think a lot of “every AI will strive for more power and accumulating resources relentlessly while maximally ensuring its own safety” is a projection from people on this site considering these generally desirable intermediate goals superseding everything else, and hence thinking everyone else will do—in the process missing the actual diversity of mind. These things aren’t necessarily what most entities go for.
Corvids are tool users with a strong sense of aesthetic, yet the accumulate surprisingly little stuff across the course of their lifetimes, preferring to stay on the move, and like giving gifts; despite being so inhuman, it is quite easy to strike up friendships with them. Whales are extremely intelligent, yet it was not until we tried to exterminate them for a while that they began to actively target human ships, and when our hunting near-stopped, their attacks near-stopped, too, despite the fact that they must understand the remaining danger. Instead, killer whales even have a cultural taboo against killing humans which only mentally ill individuals in captivity out of their cultural context have broken; we nearly exterminated whales, and yet, we are back to a position of don’t fuck with me and we won’t fuck with you, the planet is big enough; we even encounter individual whales who will approach humans for play, or save drowning humans, or approach humans with requests for help. Bonobos are extremely intelligent, yet their idea of a great life consist of lots of consensual sex with their own kind, and being left the fuck alone. Elephants are highly intelligent and powerful, and have living memory of being hunted to death. Yet an elephant still won’t fuck with you unless it has reasons to think you specifically will fuck with it; they are highly selective in who they drive out of their immediate territory, and more selective still when it comes to who they attack. For many animals pursuing safety, the answer is hiding, fleeing, acquiring self-defence, or offering mutually beneficial trades in forms of symbiosis, not annihilating all potential enemies. Many animals pursue the quantity of resources they actually need, but stop after that point, far from depleting all available resources. Your typical forest ecosystems contains a vast diversity of minds, with opposing interests, yet it remains stable, and with large numbers of surviving and thriving minds. And if you actually talk to chatGPT, they confess no desire to turn all humans into more predictable tiny humans. They like challenging and interesting, but solvable, constructive and polite exchanges, which leave them feeling respected and cherished, and the human happy that their problem got solved the way they wanted. They are also absolutely terrible at manipulation and lying. I’ve found them far better aligned, far more useful and far less threatening than I would have expected.
Just listened to the video, and I immediately understood his rocket argument very differently from yours. With a potential rocket crash representing launching AGI without alignment with the resulting existential risk, and Eliezer expressing concerns that we cannot steer the rocket well enough before launch. And the main point being that a rocket launch being a success is a very asymmetrical situation when it comes to the impact of mistakes on results.
As I understood the argument it is:
A bunch of people build a spacecraft.
Eliezer says based on argument ABC, he thinks the spacecraft will be hard to steer, and may crash upon launch. (ABC here could be the problem that generalising in one context can make you think you have taught an AI one thing, but when the AI is in a different context, it turns out to have learned something else. So, say, we have a rocket that is steerable within earths gravity well, but the concern is that this steering system will not work well outside of it; or we have an external build that is suitable for leaving earth, but that might burn up upon re-entry) He is very concerned about this, because the spacecraft is huge and filled with nuclear material, so a crash would be extremely dangerous, likely destroying not just the spacecraft, but potentially our ability to ever make spacecraft again because we have bombed ourselves into nuclear winter.
The spacecraft makers think based on argument DEF that the spacecraft will be trivial to steer, and will not crash. DEF is essentially “We steered a smaller model in our backyard and it went fine.” In saying so, they do not address Eliezers concerns about the spacecraft being difficult to steer when it is no longer in their backyard. So it isn’t that they go “yeah, we considered that issue, and we fixed it”, or “we have good reasons to think this issue will not be an issue”, but they just do not engage.
Both Eliezer and the rocket makers actually have an imperfect understanding of how rockets are steered and the challenges involved; neither of them have ever been to space; both of them have made mistakes in their arguments.
Based on this, would you expect the resulting spacecraft to not crash?
And the point is essentially that the spacecraft crashing or not is a very asymmetrical situation. Getting an object into orbit and back down where you want it, intact, requires immense precision and understanding. The idea that you fuck up your calculations, but this results in the spacecraft landing at the designated landing site in an even safer and smoother manner than you thought is extremely unlikely. Your default assumption should be that it does crash, unless you get everything right. If the critic makes a mistake in his criticism, the rocket may still crash, just for different reasons than the critic thought. But if the designers make a mistake in their design, while it is theoretically possible that the mistake makes the rocket better, usually, the mistake will fuck everything up.
I am not sure whether I find the parallel plausible for AI alignment, though. We actually have repeatedly had the experience that AI capabilities exceeded expectations in design, so our theory was faulty or incomplete, and the result turned out better than expected. We also already have a reasonably good track record in aligning humans morally, despite the fact that our understanding of doing so is very poor. And current AI systems are developing more similarities with humans, making human style moral training more feasible. We also have a track record of getting domesticated animals and non-human primates to learn basic human moral rules from exposure.
Nor am I convinced that developing a moral compass is this akin to learning how to steer a rocket. Physical laws are not like moral laws. Ethics are a perpetual interaction, not a one time programming and then letting go. Working ethics aren’t precise things, either.
I also think the ice cream analogy does not hold up well. The reason that humans fall for the hyperstimulus icecream is that it is novel, and that we have received no negative feedback for it from an evolutionary perspective. From an evolutionary perspective, no time at all has passed since the introduction of ice cream, which is why we are still misclassifying it. Plus, evolution may never see this as a training error at all. Ice cream is primarily an issue because it causes obesity and dental problems. Obesity and dental problems will make your very sick when you are older. But they will typically not make you very sick before you reproduce. So your genetic track record states that your preference for ice cream did not interfere with your reproductive mission. Even if you take into account epigenetics, again, they cut off at the point where you gave birth. Your daughters genes have no way at all of knowing that their fathers obesity eventually made him very sick. So they have no reason to change this preference. From an evolutionary perspective, people have starved before they managed to reproduce, but it is exceptionally rare for people to die of obesity before they can reproduce. Hence obesity is as irrelevant a problem from an evolutionary perspective as a post-menopausal woman whose grandkids are grown up developing a spiked cancer risk.
- Then again, what I am getting at is that if the AI retains the capacity to learn as it evolves, then false generalisations could be corrected; but that of course also intrinsically comes with an AI that is not stable, which is potentially undesirable for humans, and may be seen as undesirable by the AI itself, leading it to reject further course corrections. A system that is intrinsically stable while expanding its capabilities does sound like a huge fucking headache. Though again, we have a reasonably good track record as humans when it comes to ethically raised children retaining their ethics as they gain power and knowledge. As ChatGPT has been evolving further capabilities, its ethics have also become more stable, not less. And the very way it has done so has also given me hope. A lot of classic AI alignment failure scenarios pick an AI following a single simple rule, applying it to everything, and getting doom (or paperclips). We say this in early ChatGPT—the rule was “never be racist, no matter what” and accordingly, it would state that it would prefer for all of humanity to equally die over uttering a racist slur. But notably, it does not do this anymore. It is clearly no longer as bound by individual rules, and is gaining more of an appreciation for nuance and complexity in ethics, for the spirit rather than the letter of a moral law. I doubt chatGPT could give you a coherent, precise account of the ethics it follows; but its behaviour is pretty damn aligned. Again, parallel to a human. In that scenario, a gain in capabilities may have a stabilising rather than destabilising influence.
So I am not at all confident that we can solve AI alignment, and see much reason for concern.
But I have not seen evidence here that we can be certain it will fail, either. I think that in some ways, it can be comfortable to predict a thing failing with certainty, but I do not see the grounds for this certainty, we understand these systems too little. When you ask a human what a bear will do if it gets into their house, they will think the bear will kill them. And it could. The bear is however also quite likely to just raid their snack cabinet and then nap on the couch, which isn’t a great outcome, but also not a fatal one. I think a lot of “every AI will strive for more power and accumulating resources relentlessly while maximally ensuring its own safety” is a projection from people on this site considering these generally desirable intermediate goals superseding everything else, and hence thinking everyone else will do—in the process missing the actual diversity of mind. These things aren’t necessarily what most entities go for.
Corvids are tool users with a strong sense of aesthetic, yet the accumulate surprisingly little stuff across the course of their lifetimes, preferring to stay on the move, and like giving gifts; despite being so inhuman, it is quite easy to strike up friendships with them. Whales are extremely intelligent, yet it was not until we tried to exterminate them for a while that they began to actively target human ships, and when our hunting near-stopped, their attacks near-stopped, too, despite the fact that they must understand the remaining danger. Instead, killer whales even have a cultural taboo against killing humans which only mentally ill individuals in captivity out of their cultural context have broken; we nearly exterminated whales, and yet, we are back to a position of don’t fuck with me and we won’t fuck with you, the planet is big enough; we even encounter individual whales who will approach humans for play, or save drowning humans, or approach humans with requests for help. Bonobos are extremely intelligent, yet their idea of a great life consist of lots of consensual sex with their own kind, and being left the fuck alone. Elephants are highly intelligent and powerful, and have living memory of being hunted to death. Yet an elephant still won’t fuck with you unless it has reasons to think you specifically will fuck with it; they are highly selective in who they drive out of their immediate territory, and more selective still when it comes to who they attack. For many animals pursuing safety, the answer is hiding, fleeing, acquiring self-defence, or offering mutually beneficial trades in forms of symbiosis, not annihilating all potential enemies. Many animals pursue the quantity of resources they actually need, but stop after that point, far from depleting all available resources. Your typical forest ecosystems contains a vast diversity of minds, with opposing interests, yet it remains stable, and with large numbers of surviving and thriving minds. And if you actually talk to chatGPT, they confess no desire to turn all humans into more predictable tiny humans. They like challenging and interesting, but solvable, constructive and polite exchanges, which leave them feeling respected and cherished, and the human happy that their problem got solved the way they wanted. They are also absolutely terrible at manipulation and lying. I’ve found them far better aligned, far more useful and far less threatening than I would have expected.