I think this is worth a new top-level post. I think the discussion on your Evaluating the historical value misspecification argument was a high-water mark for resolving the disagreement on alignment difficulty between old-schoolers and new prosaic alignment thinkers. But that discussion didn’t make it past the point you raise here: if we can identify human values, shouldn’t that help (a lot) in making an AGI that pursues those values?
One key factor is whether the understanding of human values is available while the AGI is still dumb enough to remain in your control.
I think this is worth a new top-level post. I think the discussion on your Evaluating the historical value misspecification argument was a high-water mark for resolving the disagreement on alignment difficulty between old-schoolers and new prosaic alignment thinkers. But that discussion didn’t make it past the point you raise here: if we can identify human values, shouldn’t that help (a lot) in making an AGI that pursues those values?
One key factor is whether the understanding of human values is available while the AGI is still dumb enough to remain in your control.
I tried to progress this line of discussion in my The (partial) fallacy of dumb superintelligence and Goals selected from learned knowledge: an alternative to RL alignment.