I think Aaronson misunderstands the orthogonality thesis by thinking it’s making a stronger claim than it is and is thus leading you astray.
The thesis is only claiming that intelligence and morals/goals are not necessarily confounded, not that they can’t or won’t be confounded in some real systems. For example, it seems pretty clear that in GPT-4 it is not strictly orthogonal because it was trained on human text and so is heavily influenced by it. The point is that there’s no guarantee that a system won’t have a correlation between intelligence and its goals; this is something that has to be designed in if you want it.
To be clear, I had this idea long before Aaronson, so maybe Aaronson and I are confused in the same way, but I don’t think my confusion is based on Aaronson.
I think they are necessarily confounded, in the way that the simplest way to get one also gets the other. Morality and intelligence are no more orthogonal than morality and chess playing—you can have something that’s good at reasoning in general but you explicitly make it bad at playing chess, but in general, if something is good at reasoning, it will be good at playing chess too.
In the philosophy of meta-ethics, there’s not a clear distinction we can make between morals, ethics, and norms: we can use these terms interchangeable to talk about the things humans think ought be done. I think talking about “norms” is a bit more neutral and is less likely to bring up cached ideas, so I’ll talk about “norms” in my reply.
As you note, intelligence generally makes you better at doing intellectual tasks. So it stands to reason that if as AI gets smarter, it will get better at playing chess and reasoning about norms and how to behave in normative ways in increasing complex situations. No objections there!
But this isn’t really getting at the point of the orthogonality thesis. Just because an AI can reason about norms doesn’t mean it will behave in accordance with them.
Consider psychopaths. They’re humans who are often quite capable of reasoning about norms and understanding that there are certain things that others expect them to do, they just don’t care about observing norms except insofar as doing so is instrumental to achieving what they do care about. Most humans, though, aren’t like this and care about observing norms.
The point of the orthogonality thesis is to say that AI can be superintelligent psychopaths. They don’t have to be: the entire project of alignment is to try to make them not be that. But just because you make an AI smarter and better at reasoning at norms doesn’t mean it starts caring about those norms, just that it gets a lot better at figuring out how to observe norms if you can get it to care about doing so.
Much of the study of AI safety is how to ensure AI cares about observing norms that support human flourishing. The worry is that AI may start out seeming to care about such norms when humans are instrumentally necessary for them to optimize for the things they care about, but will reveal that they never actually cared about human-supporting norms once supporting humans is no longer instrumental necessary to achieve their goals.
The entire question is whether the same faculties that allow it to reason about intellectual tasks will also generalize to figuring out which are the right norms. If so, then if we accept that recognition that things are irrational can be motivating—which I argue for—then it will also act on the right norms.
I can see you’re taking a realist stance here. Let me see if I can take a different route that makes sense in terms of realism.
Let’s suppose there are moral facts and some norms are true while others are false. An intelligent AI can then determine which norms are true. Great!
Now we still have a problem, though: our AI hasn’t been programmed to follow true norms, only to discover them. Someone forgot to program that bit in. So now it knows what’s true, but it’s still going around doing bad things because no one made it care about following true norms.
This is the same situation as human psychopaths in a realist world: they may know what norms are true, they just don’t care and choose not to follow them. If you want to argue that AI will necessarily follow the true norms once it discovers them, you have to make an argument why, similarly, a human psychopath would start following true norms if they knew them, even though sort of by definition the point is that they could know true norms and ignore them anyway.
You need to somehow bind AI to care and follow true norms. I don’t see you making a case for this other than just waving your hands and saying it’ll do it because it’s true, but we have a proof by example that you can know true norms and just ignore them anyway if you want.
I think Aaronson misunderstands the orthogonality thesis by thinking it’s making a stronger claim than it is and is thus leading you astray.
The thesis is only claiming that intelligence and morals/goals are not necessarily confounded, not that they can’t or won’t be confounded in some real systems. For example, it seems pretty clear that in GPT-4 it is not strictly orthogonal because it was trained on human text and so is heavily influenced by it. The point is that there’s no guarantee that a system won’t have a correlation between intelligence and its goals; this is something that has to be designed in if you want it.
To be clear, I had this idea long before Aaronson, so maybe Aaronson and I are confused in the same way, but I don’t think my confusion is based on Aaronson.
I think they are necessarily confounded, in the way that the simplest way to get one also gets the other. Morality and intelligence are no more orthogonal than morality and chess playing—you can have something that’s good at reasoning in general but you explicitly make it bad at playing chess, but in general, if something is good at reasoning, it will be good at playing chess too.
In the philosophy of meta-ethics, there’s not a clear distinction we can make between morals, ethics, and norms: we can use these terms interchangeable to talk about the things humans think ought be done. I think talking about “norms” is a bit more neutral and is less likely to bring up cached ideas, so I’ll talk about “norms” in my reply.
As you note, intelligence generally makes you better at doing intellectual tasks. So it stands to reason that if as AI gets smarter, it will get better at playing chess and reasoning about norms and how to behave in normative ways in increasing complex situations. No objections there!
But this isn’t really getting at the point of the orthogonality thesis. Just because an AI can reason about norms doesn’t mean it will behave in accordance with them.
Consider psychopaths. They’re humans who are often quite capable of reasoning about norms and understanding that there are certain things that others expect them to do, they just don’t care about observing norms except insofar as doing so is instrumental to achieving what they do care about. Most humans, though, aren’t like this and care about observing norms.
The point of the orthogonality thesis is to say that AI can be superintelligent psychopaths. They don’t have to be: the entire project of alignment is to try to make them not be that. But just because you make an AI smarter and better at reasoning at norms doesn’t mean it starts caring about those norms, just that it gets a lot better at figuring out how to observe norms if you can get it to care about doing so.
Much of the study of AI safety is how to ensure AI cares about observing norms that support human flourishing. The worry is that AI may start out seeming to care about such norms when humans are instrumentally necessary for them to optimize for the things they care about, but will reveal that they never actually cared about human-supporting norms once supporting humans is no longer instrumental necessary to achieve their goals.
The entire question is whether the same faculties that allow it to reason about intellectual tasks will also generalize to figuring out which are the right norms. If so, then if we accept that recognition that things are irrational can be motivating—which I argue for—then it will also act on the right norms.
I can see you’re taking a realist stance here. Let me see if I can take a different route that makes sense in terms of realism.
Let’s suppose there are moral facts and some norms are true while others are false. An intelligent AI can then determine which norms are true. Great!
Now we still have a problem, though: our AI hasn’t been programmed to follow true norms, only to discover them. Someone forgot to program that bit in. So now it knows what’s true, but it’s still going around doing bad things because no one made it care about following true norms.
This is the same situation as human psychopaths in a realist world: they may know what norms are true, they just don’t care and choose not to follow them. If you want to argue that AI will necessarily follow the true norms once it discovers them, you have to make an argument why, similarly, a human psychopath would start following true norms if they knew them, even though sort of by definition the point is that they could know true norms and ignore them anyway.
You need to somehow bind AI to care and follow true norms. I don’t see you making a case for this other than just waving your hands and saying it’ll do it because it’s true, but we have a proof by example that you can know true norms and just ignore them anyway if you want.
IOW, moral norms being intrinsically motivating is a premise beyond them being objectively true.
Agreed, though I argue for it in the linked post.