But if we compare approval direction to the narrower kinds of imitation learning, approval direction seems a lot riskier because you’re optimizing over an estimation of human approval, which seems to be an adversarial process that could easily trigger safety problems in both the ground-truth human approval as well as in the estimation process.
But if there are safety problems in approval, wouldn’t there also be safety problems in the human’s behavior, which imitation learning would copy?
Similarly, if there are safety problems in the estimation process, wouldn’t there also be safety problems in the prediction of what action a human would take?
From this comment it looks like you were thinking of an online version of narrow imitation learning. Might be good to clarify that in the post?
I somewhat think that it applies to most imitation learning, not just the online variant of narrow imitation learning, but I am pretty confused/unsure. I’ll add a pointer to this discussion to the post.
But if there are safety problems in approval, wouldn’t there also be safety problems in the human’s behavior, which imitation learning would copy?
The human’s behavior could be safer because a human mind doesn’t optimize so much as to move outside of the range of inputs where approval is safe, or it has a “proposal generator” that only generates possible actions that with high probability stay within that range.
Similarly, if there are safety problems in the estimation process, wouldn’t there also be safety problems in the prediction of what action a human would take?
Same here, if you just predict what action a human would take, you’re less likely to optimize so much that you likely end up outside of where the estimation process is safe.
I somewhat think that it applies to most imitation learning, not just the online variant of narrow imitation learning, but I am pretty confused/unsure.
Ok, I’d be interested to hear more if you clarify your thoughts.
But if there are safety problems in approval, wouldn’t there also be safety problems in the human’s behavior, which imitation learning would copy?
Similarly, if there are safety problems in the estimation process, wouldn’t there also be safety problems in the prediction of what action a human would take?
I somewhat think that it applies to most imitation learning, not just the online variant of narrow imitation learning, but I am pretty confused/unsure. I’ll add a pointer to this discussion to the post.
The human’s behavior could be safer because a human mind doesn’t optimize so much as to move outside of the range of inputs where approval is safe, or it has a “proposal generator” that only generates possible actions that with high probability stay within that range.
Same here, if you just predict what action a human would take, you’re less likely to optimize so much that you likely end up outside of where the estimation process is safe.
Ok, I’d be interested to hear more if you clarify your thoughts.