The main case for optimism on human-human alignment under extreme optimization seems to be indirection: not that [what I want] and [what you want] happen to be sufficiently similar, but that there’s a [what you want] pointer within [what I want].
Value fragility doesn’t argue strongly against the pointer-based version. The tails don’t come apart when they’re tied together.
It’s not obvious that the values-on-reflection of an individual human would robustly maintain the necessary pointers (to other humans, to past selves, to alternative selves/others...), but it is at least plausible—if you pick the right human.
More generally, an argument along the lines of [the default outcome with AI doesn’t look too different from the default outcome without AI, for most people] suggests that we need to do better than the default, with or without AI. (I’m not particularly optimistic about human-human alignment without serious and principled efforts)
The main case for optimism on human-human alignment under extreme optimization seems to be indirection: not that [what I want] and [what you want] happen to be sufficiently similar, but that there’s a [what you want] pointer within [what I want].
Value fragility doesn’t argue strongly against the pointer-based version. The tails don’t come apart when they’re tied together.
It’s not obvious that the values-on-reflection of an individual human would robustly maintain the necessary pointers (to other humans, to past selves, to alternative selves/others...), but it is at least plausible—if you pick the right human.
More generally, an argument along the lines of [the default outcome with AI doesn’t look too different from the default outcome without AI, for most people] suggests that we need to do better than the default, with or without AI. (I’m not particularly optimistic about human-human alignment without serious and principled efforts)