Hm, it seems to me that RL would be more like training away the desire to deceive, although I’m not sure either “ability” or “desire” is totally on target—I think something like “habit” or “policy” captures it better. The training might not be bulletproof (AI systems might have multiple goals and sometimes notice that deception would help accomplish much), but one doesn’t need 100% elimination of deception anyway, especially not when combined with effective checks and balances.
I notice I don’t have strong opinions on what effects RL will have in this context: whether it will change just surface level specific capabilities, whether it will shift desires/motivations behind the behavior, whether it’s better to think about these systems as having habits or shards (note I don’t actually understand shard theory that well and this may be a mischaracterization) and RL shifts these, or something else. This just seems very unclear to me right now.
Do either of you have particular evidence that informs your views on this that I can update on? Maybe specifically I’m interested in knowing: assuming we are training with RL based on human feedback on diverse tasks and doing currently known safety things like adversarial training, where does this process actually push the model: toward rule following, toward lying in wait to overthrow humanity, to value its creators, etc. I currently would not be surprised if it led to “playing the training game” and lying in wait, and I would be slightly but not very surprised if it led to some safe heuristics like following rules and not harming humans. I mostly have intuition behind these beliefs.
Hm, it seems to me that RL would be more like training away the desire to deceive, although I’m not sure either “ability” or “desire” is totally on target—I think something like “habit” or “policy” captures it better. The training might not be bulletproof (AI systems might have multiple goals and sometimes notice that deception would help accomplish much), but one doesn’t need 100% elimination of deception anyway, especially not when combined with effective checks and balances.
I notice I don’t have strong opinions on what effects RL will have in this context: whether it will change just surface level specific capabilities, whether it will shift desires/motivations behind the behavior, whether it’s better to think about these systems as having habits or shards (note I don’t actually understand shard theory that well and this may be a mischaracterization) and RL shifts these, or something else. This just seems very unclear to me right now.
Do either of you have particular evidence that informs your views on this that I can update on? Maybe specifically I’m interested in knowing: assuming we are training with RL based on human feedback on diverse tasks and doing currently known safety things like adversarial training, where does this process actually push the model: toward rule following, toward lying in wait to overthrow humanity, to value its creators, etc. I currently would not be surprised if it led to “playing the training game” and lying in wait, and I would be slightly but not very surprised if it led to some safe heuristics like following rules and not harming humans. I mostly have intuition behind these beliefs.