debate doesn’t solve outer alignment.

JanB 15 Dec 2022 11:21 UTC
LW: 3 AF: 3
0
AF
My model is that if there are alignment failures that leave us neither dead nor disempowered, we’ll just solve them eventually, in similar ways as we solve everything else: through iteration, innovation, and regulation. So, from my perspective, if we’ve found a reward signal that leaves us alive and in charge, we’ve solved the important part of outer alignment. RLHF seems to provide such a reward signal (if you exclude wire-heading issues).

JanB comments on Take 9: No, RLHF/​IDA/​debate doesn’t solve outer alignment.