One approach that seems really underexplored is to directly generate data on how human preferences extend to extreme situations or very capable AIs. OpenAI was able to greatly improve the alignment of current language models by learning a reward model from text examples of current language models following human instructions, ranked by how well the AI’s output followed the human’s instruction. We should be able to generate a similar values data set, but for much AIs much stronger than current language models. See here for a more extended discussion.
I think value extrapolation is more tractable than many assume, even for vary powerful systems. I think this because I expect AI systems to strongly prefer a small number of general explanations over many shallow explanations. I expect such general explanations for human values are more likely to extend to unusual situations than more shallow explanations.
One approach that seems really underexplored is to directly generate data on how human preferences extend to extreme situations or very capable AIs. OpenAI was able to greatly improve the alignment of current language models by learning a reward model from text examples of current language models following human instructions, ranked by how well the AI’s output followed the human’s instruction. We should be able to generate a similar values data set, but for much AIs much stronger than current language models. See here for a more extended discussion.