The notion of value myopia you describe us different from the notion that I feel like comes up most often. What I sometimes see people suggest is that because the AI is trained to minimize prediction errors, it will output tokens that make future text more predictable, even if the tokens it outputs are not themselves the most likely. I think it is myopic with respect to minimizing prediction errors in this way, but I agree with you that it is not myopic with respect to the sense you describe.
> For example, when using a reward model trained from human feedback, we need to update it quickly enough on the new distribution. In particular, auto-induced distributional shift might change the distribution faster than the reward model is being updated.
I used to be less worried about this but changed my mind after the success of parameter-efficient finetuning with e.g LoRAs convinced me that you could have models with short feedback loops between their outputs and inputs (as opposed to the current regime of large training runs which are not economical to do often). I believe that training on AI generated text is a potential pathway to eventual doom but haven’t yet modelled this concretely in enough explicit detail to be confident on whether it is the first thing that kills us or if some other effect gets there earlier.
My early influences that lead me to thinking this are mostly related to dynamical mean-field theory, but I haven’t had time to develop this into a full argument.
The notion of value myopia you describe us different from the notion that I feel like comes up most often. What I sometimes see people suggest is that because the AI is trained to minimize prediction errors, it will output tokens that make future text more predictable, even if the tokens it outputs are not themselves the most likely. I think it is myopic with respect to minimizing prediction errors in this way, but I agree with you that it is not myopic with respect to the sense you describe.
Have you read “[2009.09153] Hidden Incentives for Auto-Induced Distributional Shift (arxiv.org)”? (It’s cited in Jan Leike’s Why I’m optimistic about our alignment approach (substack.com)):
> For example, when using a reward model trained from human feedback, we need to update it quickly enough on the new distribution. In particular, auto-induced distributional shift might change the distribution faster than the reward model is being updated.
I used to be less worried about this but changed my mind after the success of parameter-efficient finetuning with e.g LoRAs convinced me that you could have models with short feedback loops between their outputs and inputs (as opposed to the current regime of large training runs which are not economical to do often). I believe that training on AI generated text is a potential pathway to eventual doom but haven’t yet modelled this concretely in enough explicit detail to be confident on whether it is the first thing that kills us or if some other effect gets there earlier.
My early influences that lead me to thinking this are mostly related to dynamical mean-field theory, but I haven’t had time to develop this into a full argument.