I’m not sure we need to find a way to extract human preferences to build a pivotal program. If we, say, build a language model good enough to mimic a human internal monologue, we could simulate a team of researchers to have them solve AI safety without time pressure. They don’t need to have more stable preferences than we do, just as we don’t need self-driving cars to be safer than human-driven cars. (Why not have the language model generate papers immediately? Because that seems harder, and we have real-world evidence that a neural net can generate a human internal monologue. Also, it’s relatively easy to figure out whether the person that exists through our simulation of his internal monologue is trying to betray us.)
I’m not sure we need to find a way to extract human preferences to build a pivotal program. If we, say, build a language model good enough to mimic a human internal monologue, we could simulate a team of researchers to have them solve AI safety without time pressure. They don’t need to have more stable preferences than we do, just as we don’t need self-driving cars to be safer than human-driven cars. (Why not have the language model generate papers immediately? Because that seems harder, and we have real-world evidence that a neural net can generate a human internal monologue. Also, it’s relatively easy to figure out whether the person that exists through our simulation of his internal monologue is trying to betray us.)