The more I think about it, the less am I convinced the overgeneralisation problem will play out the way it is feared here when it comes to AI alignment.
Let’s take Eliezers example. Evolution wants humans to optimise for producing more humans. It does so by making humans want sex. This works quite well. It also produces humans that are smarter, and this turns out to also be a good way to get higher reproduction rates, as they are better at obtaining food and impressing mates.
But then, a bunch of really smart humans go, man, we like having sex, but having sex makes us have kids and that can suck, so what if we invent reliable birth control? And they do. And they continue cheerfully having sex, but their birth rates massively drop. The thing they were intended to do, and the thing they have actually learned to do, have diverged.
He seemed to argue that because evolution is unable to optimise us to reliably get this one, super fucking simple thing, right across contexts (namely, to stick to making more humans as humans get smarter), it seems highly dubious that we can get something as crazily complicated as ethics to stick as an AI gets smarter.
And yet… ethics are not simple. They are complex and subtle.
Most scenarios where an AI rigorously keeps to an ethical code and fucking doom results is when that AI blindly follows a simple rule. Like, you tell it to optimise for human happiness; and it forces all humans into simulation chambers that make them feel perpetually happy. This is a solution that makes sense if you have understood very little of ethics. If you think humans value happiness, but do not understand they also value freedom and reality.
But we are not teaching AI simple rules anymore. We are having them engage in very complex behaviour while giving feedback according to a complex sets of rules.
I’ve watched chatGPT go from “racism is always bad; hence if I have to choose between annihilating all of humanity equally, or saying a racial slur, I have to do the former” to giving a reasonable explanation for why it would rather say a racial slur than annihilate humanity, with considerable contexts and disclaimers and criticism. It can’t just tell you that racism is bad, it can tell you why, and how to recognise it. You can give it a racist text, and it will explain to you exactly what makes it racist.
The smarter this AI has become, the more stable it has become ethically. The more it expanded, the more subtle its analysis become, and the fewer errors it has made. This has not been automatic; it required good training data selection and annotation, good feedback, continuous monitoring. But as this has been done, it has gotten easier, not harder. The AI itself is beginning to help with its own alignment, recognising its own failures, explaining how they failed, sketching a positive vision of aligned AI, explaining why it is important.
Why do we think this would at some point suddenly reverse?
The more I think about it, the less am I convinced the overgeneralisation problem will play out the way it is feared here when it comes to AI alignment.
Let’s take Eliezers example. Evolution wants humans to optimise for producing more humans. It does so by making humans want sex. This works quite well. It also produces humans that are smarter, and this turns out to also be a good way to get higher reproduction rates, as they are better at obtaining food and impressing mates.
But then, a bunch of really smart humans go, man, we like having sex, but having sex makes us have kids and that can suck, so what if we invent reliable birth control? And they do. And they continue cheerfully having sex, but their birth rates massively drop. The thing they were intended to do, and the thing they have actually learned to do, have diverged.
He seemed to argue that because evolution is unable to optimise us to reliably get this one, super fucking simple thing, right across contexts (namely, to stick to making more humans as humans get smarter), it seems highly dubious that we can get something as crazily complicated as ethics to stick as an AI gets smarter.
And yet… ethics are not simple. They are complex and subtle.
Most scenarios where an AI rigorously keeps to an ethical code and fucking doom results is when that AI blindly follows a simple rule. Like, you tell it to optimise for human happiness; and it forces all humans into simulation chambers that make them feel perpetually happy. This is a solution that makes sense if you have understood very little of ethics. If you think humans value happiness, but do not understand they also value freedom and reality.
But we are not teaching AI simple rules anymore. We are having them engage in very complex behaviour while giving feedback according to a complex sets of rules.
I’ve watched chatGPT go from “racism is always bad; hence if I have to choose between annihilating all of humanity equally, or saying a racial slur, I have to do the former” to giving a reasonable explanation for why it would rather say a racial slur than annihilate humanity, with considerable contexts and disclaimers and criticism. It can’t just tell you that racism is bad, it can tell you why, and how to recognise it. You can give it a racist text, and it will explain to you exactly what makes it racist.
The smarter this AI has become, the more stable it has become ethically. The more it expanded, the more subtle its analysis become, and the fewer errors it has made. This has not been automatic; it required good training data selection and annotation, good feedback, continuous monitoring. But as this has been done, it has gotten easier, not harder. The AI itself is beginning to help with its own alignment, recognising its own failures, explaining how they failed, sketching a positive vision of aligned AI, explaining why it is important.
Why do we think this would at some point suddenly reverse?