paulfchristiano comments on Nearest unblocked strategy versus learning patches

paulfchristiano 27 Feb 2017 16:48 UTC
LW: 1 AF: 1
AF
I meant that I may be able to sample pairs from some attack distribution without being able to harden my function against the attack distribution.

Suppose that I have a program $˜ f \in [0, 1]$ which implements my desired reward function, except that it has a bunch of vulnerabilities ${˜ a}_{i}$ on which it mistakenly outputs 1 (when it really should output 0). Suppose further that I am able to sample vulnerabilities roughly as effectively as my AI.

Then I can sample vulnerabilities $˜ a$ and provide the pairs $(˜ a, - 1)$ to train my reward function, along with a bunch of pairs $(a, ˜ f (a))$ for actions $a$ produced by the agent. This doesn’t quite work as stated but you could imagine learning $f$ despite having no access to it.

(This is very similar to adversarial training / red teams).