I agree. But the point is, in order to do the thing that the CEO actually wants, the AI needs to understand goodness at least as well as the CEO. And this isn’t, like, maximal goodness for sure. But to hold up under superintelligent optimization levels, it needs a pretty significantly nuanced understanding of goodness.
I think there is some disagreement between AI camps about how difficult it is to get to the level-of-goodness the CEO’s judgment represents, when implemented in an AI system powerful enough to automate scientific research.
I think the “alignment == goodness” confusion is sort of a natural consequence of having the belief that “if you’ve built a powerful enough optimizer to automate scientific progress, your AI has to understand your conception of goodness to avoid having catastrophic consequences, and this requires making deep advances such that you’re already 90% of the way to ’build an actual benevolent sovereign.”
(I assume Mark and Ajeya have heard these arguments before, and we’re having this somewhat confused conversation because of a frame difference somewhere about how useful it is to think about goodness and what AI is for, or something)
“if you’ve built a powerful enough optimizer to automate scientific progress, your AI has to understand your conception of goodness to avoid having catastrophic consequences, and this requires making deep advances such that you’re already 90% of the way to ’build an actual benevolent sovereign.”
I think this is just not true? Consider an average human, who understands goodness enough to do science without catstrophic consequences, but is not a benevolent sovereign. One reason why they’re not a soverign is because they have high uncertainty about e.g. what they think is good, and avoid taking actions that violate deontological constraints or virtue ethics constraints or other “common sense morality.” AIs could just act similarly? Current AIs already seem like they basically know what types of things humans would think are bad or good, at least enough to know that when humans ask for coffee, they don’t mean “steal the coffee” or “do some complicated scheme that results in coffee”.
Seperately, it seems like in order for your AI act competently in the world it does have to have a pretty good understanding of “goodness”, e.g. to be able to understand why Google doesn’t do more spying on competitors, or more insider trading, or do other unethical but profitable things, etc. (Seperately, the AI will also be able to write philosophy books that are better than current ethical philosophy books, etc.)
My general claim is that if the AI takes creative catastrophic actions to disempower humans, it’s going to know that the humans don’t like this, are going to resist in the ways that they can, etc. This is a fairly large part of “understanding goodness”, and enough (it seems to me) to avoid catastrophic outcomes, as long as the AI tries to do [it’s best guess at what the humans wanted it to do] and not [just optimize for the thing the humans said to do, which it knows is not what the humans wanted it to do].
Consider an average human, who understands goodness enough to do science without catstrophic consequences, but is not a benevolent sovereign.
If “science” includes “building and testing AGIs” or “building and testing nukes” or “building and testing nanotech”, then I think the “average human” “doing science” is unaligned.
I have occasionally heard people debate whether “humans are aligned”. I find it a bit odd to think of it as a yes/no answer. I think humans are good at modeling some environments and not others. High pressured environments with superstimuli are harder than others (e.g. the historical example of “you can get intense amounts of status and money if you lead an army into war and succeed”, or the recent example of “you can get intense amounts of status and money if you lead a company to build superintelligent AGI”). Environments where lots of abstract conceptual knowledge is required to understand what’s happening (e.g. modern economies, science, etc) can easily be a situation where the human doesn’t understand the situation and makes terrible choices. I don’t think this is a minor issue, even for people with strong common-sense morality, and applies to lots of hypothetical situations where humans could self-modify.
Also relatedly in my head, I feel like I see this intuition rested on a bunch (emphasis mine):
Consider an average human, who understands goodness enough to do science without catstrophic consequences, but is not a benevolent sovereign. One reason why they’re not a soverign is because they have high uncertainty about e.g. what they think is good, and avoid taking actions that violate deontological constraints or virtue ethics constraints or other “common sense morality.”
There’s something pretty odd to me here about the way that it’s sometimes assumed that average people have uncertainty over their morality. If I think of people who feel very “morally righteous”, I don’t actually think of them as acting with much humility. The Christian father who beats his son for his minor sins, the Muslim father who murders his daughter for being raped, are people who I think have strong moral stances. I’m not saying these are the central examples of people acting based on their moral stances, but I sometimes get the sense that some folks think that all agents that have opinions about morality are ‘reflective’, whereas I think many/most humans historically simply haven’t been, and thought they understood morality quite well. To give a concrete example of how it relates to alignment discussion, I can imagine that an ML system trained to “act morally” may start acting very aggressively in-line with its morals and see those attempting to give it feedback as attempting to trick it into not acting morally.
I think I would trust these sorts of sentences to be locally valid if they said “consider an extremely reflective human of at least 130 IQ who often thinks through both consequentialist and deontological and virtue ethic lenses on their life” rather than “consider an average human”.
One reason why they’re not a soverign is because they have high uncertainty about e.g. what they think is good, and avoid taking actions that violate deontological constraints or virtue ethics constraints or other “common sense morality.”
I mean, intrinsically caring about “common sense morality” sounds to me like you basically already have an AI you can approximately let run wild.
Superintelligences plus a moral foundation that reliably covers all the common sense morality seems like it’s really very close to an AI that would just like to go off and implement CEV. CEV looks really good according to most forms of common-sense morality I can think of. My current guess is almost all the complexity of our values lives in the “common sense” part of morality.
I agree. But the point is, in order to do the thing that the CEO actually wants, the AI needs to understand goodness at least as well as the CEO. And this isn’t, like, maximal goodness for sure. But to hold up under superintelligent optimization levels, it needs a pretty significantly nuanced understanding of goodness.
I think there is some disagreement between AI camps about how difficult it is to get to the level-of-goodness the CEO’s judgment represents, when implemented in an AI system powerful enough to automate scientific research.
I think the “alignment == goodness” confusion is sort of a natural consequence of having the belief that “if you’ve built a powerful enough optimizer to automate scientific progress, your AI has to understand your conception of goodness to avoid having catastrophic consequences, and this requires making deep advances such that you’re already 90% of the way to ’build an actual benevolent sovereign.”
(I assume Mark and Ajeya have heard these arguments before, and we’re having this somewhat confused conversation because of a frame difference somewhere about how useful it is to think about goodness and what AI is for, or something)
I think this is just not true? Consider an average human, who understands goodness enough to do science without catstrophic consequences, but is not a benevolent sovereign. One reason why they’re not a soverign is because they have high uncertainty about e.g. what they think is good, and avoid taking actions that violate deontological constraints or virtue ethics constraints or other “common sense morality.” AIs could just act similarly? Current AIs already seem like they basically know what types of things humans would think are bad or good, at least enough to know that when humans ask for coffee, they don’t mean “steal the coffee” or “do some complicated scheme that results in coffee”.
Seperately, it seems like in order for your AI act competently in the world it does have to have a pretty good understanding of “goodness”, e.g. to be able to understand why Google doesn’t do more spying on competitors, or more insider trading, or do other unethical but profitable things, etc. (Seperately, the AI will also be able to write philosophy books that are better than current ethical philosophy books, etc.)
My general claim is that if the AI takes creative catastrophic actions to disempower humans, it’s going to know that the humans don’t like this, are going to resist in the ways that they can, etc. This is a fairly large part of “understanding goodness”, and enough (it seems to me) to avoid catastrophic outcomes, as long as the AI tries to do [it’s best guess at what the humans wanted it to do] and not [just optimize for the thing the humans said to do, which it knows is not what the humans wanted it to do].
If “science” includes “building and testing AGIs” or “building and testing nukes” or “building and testing nanotech”, then I think the “average human” “doing science” is unaligned.
I have occasionally heard people debate whether “humans are aligned”. I find it a bit odd to think of it as a yes/no answer. I think humans are good at modeling some environments and not others. High pressured environments with superstimuli are harder than others (e.g. the historical example of “you can get intense amounts of status and money if you lead an army into war and succeed”, or the recent example of “you can get intense amounts of status and money if you lead a company to build superintelligent AGI”). Environments where lots of abstract conceptual knowledge is required to understand what’s happening (e.g. modern economies, science, etc) can easily be a situation where the human doesn’t understand the situation and makes terrible choices. I don’t think this is a minor issue, even for people with strong common-sense morality, and applies to lots of hypothetical situations where humans could self-modify.
Also relatedly in my head, I feel like I see this intuition rested on a bunch (emphasis mine):
There’s something pretty odd to me here about the way that it’s sometimes assumed that average people have uncertainty over their morality. If I think of people who feel very “morally righteous”, I don’t actually think of them as acting with much humility. The Christian father who beats his son for his minor sins, the Muslim father who murders his daughter for being raped, are people who I think have strong moral stances. I’m not saying these are the central examples of people acting based on their moral stances, but I sometimes get the sense that some folks think that all agents that have opinions about morality are ‘reflective’, whereas I think many/most humans historically simply haven’t been, and thought they understood morality quite well. To give a concrete example of how it relates to alignment discussion, I can imagine that an ML system trained to “act morally” may start acting very aggressively in-line with its morals and see those attempting to give it feedback as attempting to trick it into not acting morally.
I think I would trust these sorts of sentences to be locally valid if they said “consider an extremely reflective human of at least 130 IQ who often thinks through both consequentialist and deontological and virtue ethic lenses on their life” rather than “consider an average human”.
I mean, intrinsically caring about “common sense morality” sounds to me like you basically already have an AI you can approximately let run wild.
Superintelligences plus a moral foundation that reliably covers all the common sense morality seems like it’s really very close to an AI that would just like to go off and implement CEV. CEV looks really good according to most forms of common-sense morality I can think of. My current guess is almost all the complexity of our values lives in the “common sense” part of morality.