I meant that I may be able to sample pairs from some attack distribution without being able to harden my function against the attack distribution.
Suppose that I have a program ˜f∈[0,1] which implements my desired reward function, except that it has a bunch of vulnerabilities ˜ai on which it mistakenly outputs 1 (when it really should output 0). Suppose further that I am able to sample vulnerabilities roughly as effectively as my AI.
Then I can sample vulnerabilities ˜a and provide the pairs (˜a,−1) to train my reward function, along with a bunch of pairs (a,˜f(a)) for actions a produced by the agent. This doesn’t quite work as stated but you could imagine learning f despite having no access to it.
(This is very similar to adversarial training / red teams).
I meant that I may be able to sample pairs from some attack distribution without being able to harden my function against the attack distribution.
Suppose that I have a program ˜f∈[0,1] which implements my desired reward function, except that it has a bunch of vulnerabilities ˜ai on which it mistakenly outputs 1 (when it really should output 0). Suppose further that I am able to sample vulnerabilities roughly as effectively as my AI.
Then I can sample vulnerabilities ˜a and provide the pairs (˜a,−1) to train my reward function, along with a bunch of pairs (a,˜f(a)) for actions a produced by the agent. This doesn’t quite work as stated but you could imagine learning f despite having no access to it.
(This is very similar to adversarial training / red teams).