More generally, a crux here is that I believe most of the alignment-relevant parts of the AIs are in large part the data it was trained on, combined with me believing that the adversarial examples where human language doesn’t track reality to be less important for alignment than a lot of people, and thus training on human data does implicitly align them 50-70% of the way towards human values at minimum.
I have trouble with the word “alignment”, although even I find myself slipping into that terminology occasionally now. What I really want is good behavior. And as you say, that’s good behavior by my values. Which I hope are closer to the values of the average person with influence over AI development than they are to the values of the global average human.
Since I don’t expect good behavior from humans, I don’t think it’s adequate to have AI that’s even 100 percent aligned, in terms of behaviorally revealed preferences, with humans-in-general as represented by the training data. A particular danger for AI is that it’s pretty common for humans, or even significant groups of humans, to get into weird corner cases and obsess over particular issues to the exclusion of things that other humans would think are more important… something that’s encouraged by targeted interventions like RLHF. Fanatically “aligned” AI could be pretty darned dystopian. But even “alignment” with the average person could result in disaster.
If you look at it in terms of of stated preferences instead of revealed preferences, I think it gets even worse. Most of ethical philosophy looks to me like humans trying to come up with post hoc ways to make “logical necessities” out of values and behaviors (or “intuitions”) that they were going to prefer anyway. If you follow the implications of the resulting systems a little bit beyond wherever their inventors stopped thinking, they usually come into violent conflict with other intuitions that are often at least as important.
If you then add the caveat that it’s only 50 to 70 percent “aligned”… well, would you want to have to deal with a human that only agreed with you 50 to 70 percent of the time on what behavior was good? Especially on big issues? I think that, on most ways of “measuring” it, the vast majority of humans are probably much better than 50 to 70 percent “aligned” with one another… but humans still aren’t mutually aligned enough to avoid massive violent conflicts over stated values, let alone massive violent conflicts over object-level outcomes.
To the extent that I understand your position, it’s that sharing a lot of values doesn’t automatically imply that AI is safe/non-dystopian to your values if built, rather than saying that alignment is hard/impossible to someone’s values (note when I say that a model is aligned, I am always focused on aligning it to one person’s values).
I also dislike the terminology, and I actually agree that alignment is not equal to safety, and this is probably one of my disagreements between a lot of LWers and myself, where I don’t think alignment automatically makes things better (in fact, things can get worse by making alignment better).
For example, it does not rule out this scenario, where the species doesn’t literally go extinct, but lots of humans die because the economic incentives for not stealing/using violence fall apart as humans become effectively worthless on the market:
To the extent that I understand your position, it’s that sharing a lot of values doesn’t automatically imply that AI is safe/non-dystopian to your values if built, rather than saying that alignment is hard/impossible to someone’s values (note when I say that a model is aligned, I am always focused on aligning it to one person’s values).
Yes, with the caveat that I am not thereby saying that it’s not hard to align to even one person’s values.
I admittedly have a lot of agreement with you, and that’s despite thinking we can make machines that do follow orders/are intent-aligned ala Seth Herd’s definition:
I have trouble with the word “alignment”, although even I find myself slipping into that terminology occasionally now. What I really want is good behavior. And as you say, that’s good behavior by my values. Which I hope are closer to the values of the average person with influence over AI development than they are to the values of the global average human.
Since I don’t expect good behavior from humans, I don’t think it’s adequate to have AI that’s even 100 percent aligned, in terms of behaviorally revealed preferences, with humans-in-general as represented by the training data. A particular danger for AI is that it’s pretty common for humans, or even significant groups of humans, to get into weird corner cases and obsess over particular issues to the exclusion of things that other humans would think are more important… something that’s encouraged by targeted interventions like RLHF. Fanatically “aligned” AI could be pretty darned dystopian. But even “alignment” with the average person could result in disaster.
If you look at it in terms of of stated preferences instead of revealed preferences, I think it gets even worse. Most of ethical philosophy looks to me like humans trying to come up with post hoc ways to make “logical necessities” out of values and behaviors (or “intuitions”) that they were going to prefer anyway. If you follow the implications of the resulting systems a little bit beyond wherever their inventors stopped thinking, they usually come into violent conflict with other intuitions that are often at least as important.
If you then add the caveat that it’s only 50 to 70 percent “aligned”… well, would you want to have to deal with a human that only agreed with you 50 to 70 percent of the time on what behavior was good? Especially on big issues? I think that, on most ways of “measuring” it, the vast majority of humans are probably much better than 50 to 70 percent “aligned” with one another… but humans still aren’t mutually aligned enough to avoid massive violent conflicts over stated values, let alone massive violent conflicts over object-level outcomes.
To the extent that I understand your position, it’s that sharing a lot of values doesn’t automatically imply that AI is safe/non-dystopian to your values if built, rather than saying that alignment is hard/impossible to someone’s values (note when I say that a model is aligned, I am always focused on aligning it to one person’s values).
I also dislike the terminology, and I actually agree that alignment is not equal to safety, and this is probably one of my disagreements between a lot of LWers and myself, where I don’t think alignment automatically makes things better (in fact, things can get worse by making alignment better).
For example, it does not rule out this scenario, where the species doesn’t literally go extinct, but lots of humans die because the economic incentives for not stealing/using violence fall apart as humans become effectively worthless on the market:
https://www.lesswrong.com/posts/2ujT9renJwdrcBqcE/the-benevolence-of-the-butcher
Yes, with the caveat that I am not thereby saying that it’s not hard to align to even one person’s values.
Fair enough.
I admittedly have a lot of agreement with you, and that’s despite thinking we can make machines that do follow orders/are intent-aligned ala Seth Herd’s definition:
https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than