The Coherent Extrapolated Volition of a human Individual (CEVI) is a completely different type of thing, than the Coherent Extrapolated Volition of Humanity (CEVH). Both are mappings to an entity of the type that can be said to want things. But only CEVI is a mapping from an entity of the type that can be said to want things (the original human). CEVH does not map from such an entity. CEVH only maps to such an entity. A group of billions of human individuals can only be seen as such an entity, if one already has a specific way of resolving disagreements, amongst individuals that disagree on how to resolve disagreements. Such a disagreement resolution rule is one necessary part of the definition of any CEVH mapping.
I like to state this as the issue that all versions of CEV/group alignment that want to aggregate thousands of people’s or more values requires implicitly resolving disagreements in values, which in turn require value-laden choices, and at that point, you are essentially doing value-alignment to what you think is good, and the nominal society is just a society of you.
I basically agree with Seth Herd here, in that instruction following is both the most likely and the best alignment target for purposes of AI safety (at least assuming offense-defense balance issues aren’t too severe).
I like to state this as the issue that all versions of CEV/group alignment that want to aggregate thousands of people’s or more values requires implicitly resolving disagreements in values, which in turn require value-laden choices, and at that point, you are essentially doing value-alignment to what you think is good, and the nominal society is just a society of you.
I basically agree with Seth Herd here, in that instruction following is both the most likely and the best alignment target for purposes of AI safety (at least assuming offense-defense balance issues aren’t too severe).