> For example, when using a reward model trained from human feedback, we need to update it quickly enough on the new distribution. In particular, auto-induced distributional shift might change the distribution faster than the reward model is being updated.
I used to be less worried about this but changed my mind after the success of parameter-efficient finetuning with e.g LoRAs convinced me that you could have models with short feedback loops between their outputs and inputs (as opposed to the current regime of large training runs which are not economical to do often). I believe that training on AI generated text is a potential pathway to eventual doom but haven’t yet modelled this concretely in enough explicit detail to be confident on whether it is the first thing that kills us or if some other effect gets there earlier.
My early influences that lead me to thinking this are mostly related to dynamical mean-field theory, but I haven’t had time to develop this into a full argument.
Have you read “[2009.09153] Hidden Incentives for Auto-Induced Distributional Shift (arxiv.org)”? (It’s cited in Jan Leike’s Why I’m optimistic about our alignment approach (substack.com)):
> For example, when using a reward model trained from human feedback, we need to update it quickly enough on the new distribution. In particular, auto-induced distributional shift might change the distribution faster than the reward model is being updated.
I used to be less worried about this but changed my mind after the success of parameter-efficient finetuning with e.g LoRAs convinced me that you could have models with short feedback loops between their outputs and inputs (as opposed to the current regime of large training runs which are not economical to do often). I believe that training on AI generated text is a potential pathway to eventual doom but haven’t yet modelled this concretely in enough explicit detail to be confident on whether it is the first thing that kills us or if some other effect gets there earlier.
My early influences that lead me to thinking this are mostly related to dynamical mean-field theory, but I haven’t had time to develop this into a full argument.