I still don’t think you’ve proposed an alternative to “training a model with human feedback”. “maintaining some qualitative distance between the optimisation target for an AI model and the human “does this look good?” function” sounds nice, but how do we even do that? What else should we optimise the model for, or how should we make it aligned? If you think the solution is use AI-assisted humans as overseers, then that doesn’t seem to be a real difference with what Buck is saying. So even if he actually had written that he’s not aware of an alternative to “training a model with human/overseer feedback”, I don’t think you’ve refuted that point.
Briefly, the alternative optimisation target I would suggest is performance at achieving intelligible, formally specified goals within a purely predictive model/simulation of the real world.
Humans could then look at what happens in the simulations and say “gee, that doesn’t look good,” and specify better goals instead, and the policy won’t experience gradient pressure to make those evaluations systematically wrong.
This isn’t the place where I want to make a case for the “competitiveness” or tractability of that kind of approach, but what I want to claim here is that it is an example of an alignment paradigm that does leverage machine learning (both to make a realistic model of the world and to optimise policies for acting within that model) but does not directly use human approval (or an opaque model thereof) as an optimisation target in the kind of way that seems problematic about RLHF.
Thanks for the answer! I feel uncertain whether that suggestion is an “alignment” paradigm/method though—either these formally specified goals don’t cover most of the things we care about, in which case this doesn’t seem that useful, or they do, in which case I’m pretty uncertain how we can formally specify them—that’s kind of the whole outer alignment problem. Also, there is still (weaker) pressure to produce outputs that look good to humans, if humans are searching over goals to find those that produce good outputs. I agree it’s further away, but that seems like it could also be a bad thing, if it makes it harder to pressure the models to actually do what we want in the first place.
I still don’t think you’ve proposed an alternative to “training a model with human feedback”. “maintaining some qualitative distance between the optimisation target for an AI model and the human “does this look good?” function” sounds nice, but how do we even do that? What else should we optimise the model for, or how should we make it aligned? If you think the solution is use AI-assisted humans as overseers, then that doesn’t seem to be a real difference with what Buck is saying. So even if he actually had written that he’s not aware of an alternative to “training a model with human/overseer feedback”, I don’t think you’ve refuted that point.
Briefly, the alternative optimisation target I would suggest is performance at achieving intelligible, formally specified goals within a purely predictive model/simulation of the real world.
Humans could then look at what happens in the simulations and say “gee, that doesn’t look good,” and specify better goals instead, and the policy won’t experience gradient pressure to make those evaluations systematically wrong.
This isn’t the place where I want to make a case for the “competitiveness” or tractability of that kind of approach, but what I want to claim here is that it is an example of an alignment paradigm that does leverage machine learning (both to make a realistic model of the world and to optimise policies for acting within that model) but does not directly use human approval (or an opaque model thereof) as an optimisation target in the kind of way that seems problematic about RLHF.
Thanks for the answer! I feel uncertain whether that suggestion is an “alignment” paradigm/method though—either these formally specified goals don’t cover most of the things we care about, in which case this doesn’t seem that useful, or they do, in which case I’m pretty uncertain how we can formally specify them—that’s kind of the whole outer alignment problem. Also, there is still (weaker) pressure to produce outputs that look good to humans, if humans are searching over goals to find those that produce good outputs. I agree it’s further away, but that seems like it could also be a bad thing, if it makes it harder to pressure the models to actually do what we want in the first place.