In trying to argue for the safety of imitation learning, the key property of the imitation might be something like “the imitation is indistinguishable from the actual human on this distribution, where the distinguishers may be from this model class and have this information”. GANs get a property like this pretty directly, although there is an issue in that they don’t have access to the imitator’s internal states, so e.g. they can’t tell whether the imitator is using pseudorandomness or the input-based randomness imitators have access to.
Suppose we have an imitation system with a property like this. Then the argument for safety is going to look something like: “The human being imitated is corrigible. If the imitator weren’t corrigible, then the discriminator could tell, because this would result in distinguishable actions.”
For example, let’s say that the imitated human is able to violently take control of the AI lab. In fact the imitated human doesn’t do this (in any of the training data), and, arguably, it doesn’t take a smart discriminator to tell that the human probably wouldn’t do this in a new situation. So, if the imitator did do this (because e.g. it incorrectly inferred the human’s decision theory), the discriminator could tell it apart. Of course, arguing that the discriminator generalizes this well would require some robustness argument; this particular problem seems easy (if the method for taking control involves taking really obvious actions like using weapons) but there might be more subtle ways of taking control. In those cases we would want some argument that, if the imitator comes up with a malign/incorrigible plan, then a discriminator with access to the imitator’s internal states can notice this and notice that the imitated human wouldn’t do this, because this isn’t harder than coming up with the plan in the first place, and the discriminator is at least as capable as the imitator.
In general, while there are potential problems, I expect them to be more subtle than “the imitator incorrectly infers the human’s decision theory and pursues convergent instrumental goals”.
(Worth noting other problems with imitation learning, discussed in this post and this post)
I think I disagree pretty broadly with the assumptions/framing of your comment, although not necessarily the specific claims.
1) I don’t think it’s realistic to imagine we have “indistinguishable imitation” with an idealized discriminator. It might be possible in the future, and it might be worth considering to make intellectual progress, but I’m not expecting it to happen on a deadline. So I’m talking about what I expect might be a practical problem if we actually try to build systems that imitate humans in the coming decades.
2) I wouldn’t say “decision theory”; I think that’s a bit of a red herring. What I’m talking about is the policy.
3) I’m not sure the link you are trying to make to the “universal prior is malign” ideas. But I’ll draw my own connection. I do think the core of the argument I’m making results from an intuitive idea of what a simplicity prior looks like, and its propensity to favor something more like a planning process over something more like a lookup table.
In trying to argue for the safety of imitation learning, the key property of the imitation might be something like “the imitation is indistinguishable from the actual human on this distribution, where the distinguishers may be from this model class and have this information”. GANs get a property like this pretty directly, although there is an issue in that they don’t have access to the imitator’s internal states, so e.g. they can’t tell whether the imitator is using pseudorandomness or the input-based randomness imitators have access to.
Suppose we have an imitation system with a property like this. Then the argument for safety is going to look something like: “The human being imitated is corrigible. If the imitator weren’t corrigible, then the discriminator could tell, because this would result in distinguishable actions.”
For example, let’s say that the imitated human is able to violently take control of the AI lab. In fact the imitated human doesn’t do this (in any of the training data), and, arguably, it doesn’t take a smart discriminator to tell that the human probably wouldn’t do this in a new situation. So, if the imitator did do this (because e.g. it incorrectly inferred the human’s decision theory), the discriminator could tell it apart. Of course, arguing that the discriminator generalizes this well would require some robustness argument; this particular problem seems easy (if the method for taking control involves taking really obvious actions like using weapons) but there might be more subtle ways of taking control. In those cases we would want some argument that, if the imitator comes up with a malign/incorrigible plan, then a discriminator with access to the imitator’s internal states can notice this and notice that the imitated human wouldn’t do this, because this isn’t harder than coming up with the plan in the first place, and the discriminator is at least as capable as the imitator.
In general, while there are potential problems, I expect them to be more subtle than “the imitator incorrectly infers the human’s decision theory and pursues convergent instrumental goals”.
(Worth noting other problems with imitation learning, discussed in this post and this post)
I think I disagree pretty broadly with the assumptions/framing of your comment, although not necessarily the specific claims.
1) I don’t think it’s realistic to imagine we have “indistinguishable imitation” with an idealized discriminator. It might be possible in the future, and it might be worth considering to make intellectual progress, but I’m not expecting it to happen on a deadline. So I’m talking about what I expect might be a practical problem if we actually try to build systems that imitate humans in the coming decades.
2) I wouldn’t say “decision theory”; I think that’s a bit of a red herring. What I’m talking about is the policy.
3) I’m not sure the link you are trying to make to the “universal prior is malign” ideas. But I’ll draw my own connection. I do think the core of the argument I’m making results from an intuitive idea of what a simplicity prior looks like, and its propensity to favor something more like a planning process over something more like a lookup table.