those that rely on arbitrary AGIs detecting and [settling on as natural] the same features of the world that humans do, including values and qualities important to humanity
can you give examples of such strategies, and argue that they rely on this?
I’m in a weird situation here: I’m not entirely sure whether the community considers the Learning Theory Agenda to be the same alignment plan as The Plan (which is arguably not a plan at all but he sure thinks about value learning!), and whether I can count things like the class of scalable oversight plans which take as read that “human values” are a specific natural object. Would you at least agree that those first two (or one???) rely on that?
can you give examples of such strategies, and argue that they rely on this?
I’m in a weird situation here: I’m not entirely sure whether the community considers the Learning Theory Agenda to be the same alignment plan as The Plan (which is arguably not a plan at all but he sure thinks about value learning!), and whether I can count things like the class of scalable oversight plans which take as read that “human values” are a specific natural object. Would you at least agree that those first two (or one???) rely on that?