I notice I don’t have strong opinions on what effects RL will have in this context: whether it will change just surface level specific capabilities, whether it will shift desires/motivations behind the behavior, whether it’s better to think about these systems as having habits or shards (note I don’t actually understand shard theory that well and this may be a mischaracterization) and RL shifts these, or something else. This just seems very unclear to me right now.
Do either of you have particular evidence that informs your views on this that I can update on? Maybe specifically I’m interested in knowing: assuming we are training with RL based on human feedback on diverse tasks and doing currently known safety things like adversarial training, where does this process actually push the model: toward rule following, toward lying in wait to overthrow humanity, to value its creators, etc. I currently would not be surprised if it led to “playing the training game” and lying in wait, and I would be slightly but not very surprised if it led to some safe heuristics like following rules and not harming humans. I mostly have intuition behind these beliefs.
I notice I don’t have strong opinions on what effects RL will have in this context: whether it will change just surface level specific capabilities, whether it will shift desires/motivations behind the behavior, whether it’s better to think about these systems as having habits or shards (note I don’t actually understand shard theory that well and this may be a mischaracterization) and RL shifts these, or something else. This just seems very unclear to me right now.
Do either of you have particular evidence that informs your views on this that I can update on? Maybe specifically I’m interested in knowing: assuming we are training with RL based on human feedback on diverse tasks and doing currently known safety things like adversarial training, where does this process actually push the model: toward rule following, toward lying in wait to overthrow humanity, to value its creators, etc. I currently would not be surprised if it led to “playing the training game” and lying in wait, and I would be slightly but not very surprised if it led to some safe heuristics like following rules and not harming humans. I mostly have intuition behind these beliefs.