paulfchristiano comments on New safety research agenda: scalable agent alignment via reward modeling

paulfchristiano 19 Dec 2018 22:32 UTC
LW: 3 AF: 2
0
AF
Finding the action $a$ that optimizes a reward function $r (a)$ is $N P$ -complete for general $r$ . If the reward function $r$ is itself able to use an oracle for $N P$ , then that’s complete for ${N P}^{N P}$ , and so on. The analogy is loose because you aren’t really getting the optimal $a$ at each step.