I expected this comment, value alignment or CEV indeed doesn’t have the few-human coup disadvantage. It does however have other disadvantages. My biggest issue with both is that they seem irreversible. If your values or your specific CEV implementation turns out to be terrible for the world, you’re locked in and there’s no going back. Also, a value-aligned or CEV takeover-level AI would probably start straight away with a takeover, since else it can’t enforce its values in a world where many will always disagree. That takeover won’t exactly increase its popularity. I think a minimum requirement should be that a type of alignment is adjustable by humans, and intent-alignment is the only type that meets that requirement as far as I know.
I agree that trying to “jump straight to the end”—the supposedly-aligned AI pops fully formed out of the lab like Athena from the forehead of Zeus—would be bad.
And yet some form of value alignment still seems critical. You might prefer to imagine value alignment as the logical continuation of training Claude to not help you build a bomb (or commit a coup). Such safeguards seem like a pretty good idea to me. But as the model becomes smarter and more situationally aware, and is expected to defend against subversion attempts that involve more of the real world, training for this behavior becomes more and more value-inducing, to the point where it’s eventually unsafe unless you make advancements in learning values in a way that’s good according to humans.
I expected this comment, value alignment or CEV indeed doesn’t have the few-human coup disadvantage. It does however have other disadvantages. My biggest issue with both is that they seem irreversible. If your values or your specific CEV implementation turns out to be terrible for the world, you’re locked in and there’s no going back. Also, a value-aligned or CEV takeover-level AI would probably start straight away with a takeover, since else it can’t enforce its values in a world where many will always disagree. That takeover won’t exactly increase its popularity. I think a minimum requirement should be that a type of alignment is adjustable by humans, and intent-alignment is the only type that meets that requirement as far as I know.
I agree that trying to “jump straight to the end”—the supposedly-aligned AI pops fully formed out of the lab like Athena from the forehead of Zeus—would be bad.
And yet some form of value alignment still seems critical. You might prefer to imagine value alignment as the logical continuation of training Claude to not help you build a bomb (or commit a coup). Such safeguards seem like a pretty good idea to me. But as the model becomes smarter and more situationally aware, and is expected to defend against subversion attempts that involve more of the real world, training for this behavior becomes more and more value-inducing, to the point where it’s eventually unsafe unless you make advancements in learning values in a way that’s good according to humans.