I see, thanks for clarifying. I agree that it might be straightforward to catch bad behavior (e.g. deception), but I expect that RL methods will work by training away the ability of the system to deceive, rather than the desire.[1] So even if such training succeeds, in the sense that the system robustly behaves honestly, it will also no longer be human-level-ish, since humans are capable of being deceptive.
Maybe it is possible to create an AI system that is like the humans in the movie The Invention of Lying, but that seems difficult and fragile. In the movie, one guy discovers he can lie, and suddenly he can run roughshod over his entire civilization. The humans in the movie initially have no ability to lie, but once the main character discovers it, he immediately realizes its usefulness. The only thing that keeps other people from making the same realization is the fictional conceit of the movie.
Or, paraphrasing Nate: the ability to deceive is a consequence of understanding how the world works on a sufficiently deep level, so it’s probably not something that can be trained away by RL, without also training away the ability to generalize at human levels entirely.
OTOH, if you could somehow imbue an innate desire to be honest into the system without affecting its capabilities, that might be more promising. But again, I don’t think that’s what SGD or current RL methods are actually doing. (Though it is hard to be sure, in part because no current AI systems appear to exhibit desires or inner motivations of any kind. I think attempts to analogize the workings of such systems to desires in humans and components in the brain are mostly spurious pattern-matching, but that’s a different topic.)
In the words of Alex Turner, in RL, “reward chisels cognitive grooves into an agent”. Rewarding non-deceptive behavior could thus chisel away the cognition capable of performing the deception, but that cognition might be what makes the system human-level in the first place.
Hm, it seems to me that RL would be more like training away the desire to deceive, although I’m not sure either “ability” or “desire” is totally on target—I think something like “habit” or “policy” captures it better. The training might not be bulletproof (AI systems might have multiple goals and sometimes notice that deception would help accomplish much), but one doesn’t need 100% elimination of deception anyway, especially not when combined with effective checks and balances.
I notice I don’t have strong opinions on what effects RL will have in this context: whether it will change just surface level specific capabilities, whether it will shift desires/motivations behind the behavior, whether it’s better to think about these systems as having habits or shards (note I don’t actually understand shard theory that well and this may be a mischaracterization) and RL shifts these, or something else. This just seems very unclear to me right now.
Do either of you have particular evidence that informs your views on this that I can update on? Maybe specifically I’m interested in knowing: assuming we are training with RL based on human feedback on diverse tasks and doing currently known safety things like adversarial training, where does this process actually push the model: toward rule following, toward lying in wait to overthrow humanity, to value its creators, etc. I currently would not be surprised if it led to “playing the training game” and lying in wait, and I would be slightly but not very surprised if it led to some safe heuristics like following rules and not harming humans. I mostly have intuition behind these beliefs.
I see, thanks for clarifying. I agree that it might be straightforward to catch bad behavior (e.g. deception), but I expect that RL methods will work by training away the ability of the system to deceive, rather than the desire.[1] So even if such training succeeds, in the sense that the system robustly behaves honestly, it will also no longer be human-level-ish, since humans are capable of being deceptive.
Maybe it is possible to create an AI system that is like the humans in the movie The Invention of Lying, but that seems difficult and fragile. In the movie, one guy discovers he can lie, and suddenly he can run roughshod over his entire civilization. The humans in the movie initially have no ability to lie, but once the main character discovers it, he immediately realizes its usefulness. The only thing that keeps other people from making the same realization is the fictional conceit of the movie.
Or, paraphrasing Nate: the ability to deceive is a consequence of understanding how the world works on a sufficiently deep level, so it’s probably not something that can be trained away by RL, without also training away the ability to generalize at human levels entirely.
OTOH, if you could somehow imbue an innate desire to be honest into the system without affecting its capabilities, that might be more promising. But again, I don’t think that’s what SGD or current RL methods are actually doing. (Though it is hard to be sure, in part because no current AI systems appear to exhibit desires or inner motivations of any kind. I think attempts to analogize the workings of such systems to desires in humans and components in the brain are mostly spurious pattern-matching, but that’s a different topic.)
In the words of Alex Turner, in RL, “reward chisels cognitive grooves into an agent”. Rewarding non-deceptive behavior could thus chisel away the cognition capable of performing the deception, but that cognition might be what makes the system human-level in the first place.
Hm, it seems to me that RL would be more like training away the desire to deceive, although I’m not sure either “ability” or “desire” is totally on target—I think something like “habit” or “policy” captures it better. The training might not be bulletproof (AI systems might have multiple goals and sometimes notice that deception would help accomplish much), but one doesn’t need 100% elimination of deception anyway, especially not when combined with effective checks and balances.
I notice I don’t have strong opinions on what effects RL will have in this context: whether it will change just surface level specific capabilities, whether it will shift desires/motivations behind the behavior, whether it’s better to think about these systems as having habits or shards (note I don’t actually understand shard theory that well and this may be a mischaracterization) and RL shifts these, or something else. This just seems very unclear to me right now.
Do either of you have particular evidence that informs your views on this that I can update on? Maybe specifically I’m interested in knowing: assuming we are training with RL based on human feedback on diverse tasks and doing currently known safety things like adversarial training, where does this process actually push the model: toward rule following, toward lying in wait to overthrow humanity, to value its creators, etc. I currently would not be surprised if it led to “playing the training game” and lying in wait, and I would be slightly but not very surprised if it led to some safe heuristics like following rules and not harming humans. I mostly have intuition behind these beliefs.