The best solution I can think of to outer-aligning an AGI capable of doing STEM research is to build one that’s a value learner and an alignment researcher. Obviously for a value learner, doing alignment research is a convergent instrumental strategy: it wants to do whatever humans want, so it needs to better figure out what that is so it can do a better job. Then human values become an attractor.
However, to implement this strategy, you first need to build a value-learning AGI capable of doing STEM research (which obviously we don’t yet know how to do) that is initially sufficiently aligned to human values that it starts off inside the basin of attraction. I.e. it needs a passable first guess at human values for it to improve upon: one that’s sufficiently close that a) it doesn’t kill us all in the meantime while its understanding of our values is converging, b) it understands that we want things from it like honesty, corrigibility, willingness to shut down, fairness and so forth, and c) that we can’t give it a complete description of human values because we don’t fully understand them ourselves.
Your suggestion of using something like an LLM to encode a representation of human values is exactly the lines that I think we should be thinking on for that “initial starting value” for human values for a value learning AGI. Indeed, there are already researchers building ethical question testing sets for LLMs.
The best solution I can think of to outer-aligning an AGI capable of doing STEM research is to build one that’s a value learner and an alignment researcher. Obviously for a value learner, doing alignment research is a convergent instrumental strategy: it wants to do whatever humans want, so it needs to better figure out what that is so it can do a better job. Then human values become an attractor.
However, to implement this strategy, you first need to build a value-learning AGI capable of doing STEM research (which obviously we don’t yet know how to do) that is initially sufficiently aligned to human values that it starts off inside the basin of attraction. I.e. it needs a passable first guess at human values for it to improve upon: one that’s sufficiently close that a) it doesn’t kill us all in the meantime while its understanding of our values is converging, b) it understands that we want things from it like honesty, corrigibility, willingness to shut down, fairness and so forth, and c) that we can’t give it a complete description of human values because we don’t fully understand them ourselves.
Your suggestion of using something like an LLM to encode a representation of human values is exactly the lines that I think we should be thinking on for that “initial starting value” for human values for a value learning AGI. Indeed, there are already researchers building ethical question testing sets for LLMs.