I’m not really aware of any compelling alternatives to this class of plan–“training a model based on a reward signal” is basically all of machine learning
I think the actual concern there is about human feedback, but you phrased the question as about overseer feedback, but then your answer (quoted) is about any reward signal at all.
Is next-token prediction already “training a model based on a reward signal”? A little bit—there’s a loss function! But is it effectively RL on next-token-prediction reward/feedback? Not really. Next-token prediction, by contrast to RL, only does one-step lookahead and doesn’t use a value network (only a policy network). Next-token prediction is qualitatively different from RL because it doesn’t do any backprop-through-time (which can induce emergent/convergent forward-looking “grooves” in trajectory-space which were not found in the [pre]training distribution).
Perhaps more importantly, maintaining some qualitative distance between the optimisation target for an AI model and the human “does this look good?” function seems valuable, for similar reasons to why one holds out some of the historical data while training a financial model, or why one separates a test distribution from a validation distribution. When we give the optimisation process unfettered access to the function that’s ultimately going to make decisions about how all-things-considered good the result is, the opportunities for unmitigated overfitting/Goodharting are greatly increased.
I still don’t think you’ve proposed an alternative to “training a model with human feedback”. “maintaining some qualitative distance between the optimisation target for an AI model and the human “does this look good?” function” sounds nice, but how do we even do that? What else should we optimise the model for, or how should we make it aligned? If you think the solution is use AI-assisted humans as overseers, then that doesn’t seem to be a real difference with what Buck is saying. So even if he actually had written that he’s not aware of an alternative to “training a model with human/overseer feedback”, I don’t think you’ve refuted that point.
Briefly, the alternative optimisation target I would suggest is performance at achieving intelligible, formally specified goals within a purely predictive model/simulation of the real world.
Humans could then look at what happens in the simulations and say “gee, that doesn’t look good,” and specify better goals instead, and the policy won’t experience gradient pressure to make those evaluations systematically wrong.
This isn’t the place where I want to make a case for the “competitiveness” or tractability of that kind of approach, but what I want to claim here is that it is an example of an alignment paradigm that does leverage machine learning (both to make a realistic model of the world and to optimise policies for acting within that model) but does not directly use human approval (or an opaque model thereof) as an optimisation target in the kind of way that seems problematic about RLHF.
Thanks for the answer! I feel uncertain whether that suggestion is an “alignment” paradigm/method though—either these formally specified goals don’t cover most of the things we care about, in which case this doesn’t seem that useful, or they do, in which case I’m pretty uncertain how we can formally specify them—that’s kind of the whole outer alignment problem. Also, there is still (weaker) pressure to produce outputs that look good to humans, if humans are searching over goals to find those that produce good outputs. I agree it’s further away, but that seems like it could also be a bad thing, if it makes it harder to pressure the models to actually do what we want in the first place.
I think the actual concern there is about human feedback, but you phrased the question as about overseer feedback, but then your answer (quoted) is about any reward signal at all.
I think that some people actually have the concern I responded to there, rather than the concern you say that they might have instead.
I agree that I conflated between overseer feedback and any reward signal at all; I wondered while writing the post whether this conflation would be a problem. I don’t think it affects the situation much but it’s reasonable for you to ask me to justify that.
I think the actual concern there is about human feedback, but you phrased the question as about overseer feedback, but then your answer (quoted) is about any reward signal at all.
Is next-token prediction already “training a model based on a reward signal”? A little bit—there’s a loss function! But is it effectively RL on next-token-prediction reward/feedback? Not really. Next-token prediction, by contrast to RL, only does one-step lookahead and doesn’t use a value network (only a policy network). Next-token prediction is qualitatively different from RL because it doesn’t do any backprop-through-time (which can induce emergent/convergent forward-looking “grooves” in trajectory-space which were not found in the [pre]training distribution).
Perhaps more importantly, maintaining some qualitative distance between the optimisation target for an AI model and the human “does this look good?” function seems valuable, for similar reasons to why one holds out some of the historical data while training a financial model, or why one separates a test distribution from a validation distribution. When we give the optimisation process unfettered access to the function that’s ultimately going to make decisions about how all-things-considered good the result is, the opportunities for unmitigated overfitting/Goodharting are greatly increased.
I still don’t think you’ve proposed an alternative to “training a model with human feedback”. “maintaining some qualitative distance between the optimisation target for an AI model and the human “does this look good?” function” sounds nice, but how do we even do that? What else should we optimise the model for, or how should we make it aligned? If you think the solution is use AI-assisted humans as overseers, then that doesn’t seem to be a real difference with what Buck is saying. So even if he actually had written that he’s not aware of an alternative to “training a model with human/overseer feedback”, I don’t think you’ve refuted that point.
Briefly, the alternative optimisation target I would suggest is performance at achieving intelligible, formally specified goals within a purely predictive model/simulation of the real world.
Humans could then look at what happens in the simulations and say “gee, that doesn’t look good,” and specify better goals instead, and the policy won’t experience gradient pressure to make those evaluations systematically wrong.
This isn’t the place where I want to make a case for the “competitiveness” or tractability of that kind of approach, but what I want to claim here is that it is an example of an alignment paradigm that does leverage machine learning (both to make a realistic model of the world and to optimise policies for acting within that model) but does not directly use human approval (or an opaque model thereof) as an optimisation target in the kind of way that seems problematic about RLHF.
Thanks for the answer! I feel uncertain whether that suggestion is an “alignment” paradigm/method though—either these formally specified goals don’t cover most of the things we care about, in which case this doesn’t seem that useful, or they do, in which case I’m pretty uncertain how we can formally specify them—that’s kind of the whole outer alignment problem. Also, there is still (weaker) pressure to produce outputs that look good to humans, if humans are searching over goals to find those that produce good outputs. I agree it’s further away, but that seems like it could also be a bad thing, if it makes it harder to pressure the models to actually do what we want in the first place.
I think that some people actually have the concern I responded to there, rather than the concern you say that they might have instead.
I agree that I conflated between overseer feedback and any reward signal at all; I wondered while writing the post whether this conflation would be a problem. I don’t think it affects the situation much but it’s reasonable for you to ask me to justify that.