Thanks for the answer! I feel uncertain whether that suggestion is an “alignment” paradigm/method though—either these formally specified goals don’t cover most of the things we care about, in which case this doesn’t seem that useful, or they do, in which case I’m pretty uncertain how we can formally specify them—that’s kind of the whole outer alignment problem. Also, there is still (weaker) pressure to produce outputs that look good to humans, if humans are searching over goals to find those that produce good outputs. I agree it’s further away, but that seems like it could also be a bad thing, if it makes it harder to pressure the models to actually do what we want in the first place.
Thanks for the answer! I feel uncertain whether that suggestion is an “alignment” paradigm/method though—either these formally specified goals don’t cover most of the things we care about, in which case this doesn’t seem that useful, or they do, in which case I’m pretty uncertain how we can formally specify them—that’s kind of the whole outer alignment problem. Also, there is still (weaker) pressure to produce outputs that look good to humans, if humans are searching over goals to find those that produce good outputs. I agree it’s further away, but that seems like it could also be a bad thing, if it makes it harder to pressure the models to actually do what we want in the first place.