Consider an average human, who understands goodness enough to do science without catstrophic consequences, but is not a benevolent sovereign.
If “science” includes “building and testing AGIs” or “building and testing nukes” or “building and testing nanotech”, then I think the “average human” “doing science” is unaligned.
I have occasionally heard people debate whether “humans are aligned”. I find it a bit odd to think of it as a yes/no answer. I think humans are good at modeling some environments and not others. High pressured environments with superstimuli are harder than others (e.g. the historical example of “you can get intense amounts of status and money if you lead an army into war and succeed”, or the recent example of “you can get intense amounts of status and money if you lead a company to build superintelligent AGI”). Environments where lots of abstract conceptual knowledge is required to understand what’s happening (e.g. modern economies, science, etc) can easily be a situation where the human doesn’t understand the situation and makes terrible choices. I don’t think this is a minor issue, even for people with strong common-sense morality, and applies to lots of hypothetical situations where humans could self-modify.
Also relatedly in my head, I feel like I see this intuition rested on a bunch (emphasis mine):
Consider an average human, who understands goodness enough to do science without catstrophic consequences, but is not a benevolent sovereign. One reason why they’re not a soverign is because they have high uncertainty about e.g. what they think is good, and avoid taking actions that violate deontological constraints or virtue ethics constraints or other “common sense morality.”
There’s something pretty odd to me here about the way that it’s sometimes assumed that average people have uncertainty over their morality. If I think of people who feel very “morally righteous”, I don’t actually think of them as acting with much humility. The Christian father who beats his son for his minor sins, the Muslim father who murders his daughter for being raped, are people who I think have strong moral stances. I’m not saying these are the central examples of people acting based on their moral stances, but I sometimes get the sense that some folks think that all agents that have opinions about morality are ‘reflective’, whereas I think many/most humans historically simply haven’t been, and thought they understood morality quite well. To give a concrete example of how it relates to alignment discussion, I can imagine that an ML system trained to “act morally” may start acting very aggressively in-line with its morals and see those attempting to give it feedback as attempting to trick it into not acting morally.
I think I would trust these sorts of sentences to be locally valid if they said “consider an extremely reflective human of at least 130 IQ who often thinks through both consequentialist and deontological and virtue ethic lenses on their life” rather than “consider an average human”.
If “science” includes “building and testing AGIs” or “building and testing nukes” or “building and testing nanotech”, then I think the “average human” “doing science” is unaligned.
I have occasionally heard people debate whether “humans are aligned”. I find it a bit odd to think of it as a yes/no answer. I think humans are good at modeling some environments and not others. High pressured environments with superstimuli are harder than others (e.g. the historical example of “you can get intense amounts of status and money if you lead an army into war and succeed”, or the recent example of “you can get intense amounts of status and money if you lead a company to build superintelligent AGI”). Environments where lots of abstract conceptual knowledge is required to understand what’s happening (e.g. modern economies, science, etc) can easily be a situation where the human doesn’t understand the situation and makes terrible choices. I don’t think this is a minor issue, even for people with strong common-sense morality, and applies to lots of hypothetical situations where humans could self-modify.
Also relatedly in my head, I feel like I see this intuition rested on a bunch (emphasis mine):
There’s something pretty odd to me here about the way that it’s sometimes assumed that average people have uncertainty over their morality. If I think of people who feel very “morally righteous”, I don’t actually think of them as acting with much humility. The Christian father who beats his son for his minor sins, the Muslim father who murders his daughter for being raped, are people who I think have strong moral stances. I’m not saying these are the central examples of people acting based on their moral stances, but I sometimes get the sense that some folks think that all agents that have opinions about morality are ‘reflective’, whereas I think many/most humans historically simply haven’t been, and thought they understood morality quite well. To give a concrete example of how it relates to alignment discussion, I can imagine that an ML system trained to “act morally” may start acting very aggressively in-line with its morals and see those attempting to give it feedback as attempting to trick it into not acting morally.
I think I would trust these sorts of sentences to be locally valid if they said “consider an extremely reflective human of at least 130 IQ who often thinks through both consequentialist and deontological and virtue ethic lenses on their life” rather than “consider an average human”.