If you think that wire-heading is what does us in, why not write a post about how RL, in general, is bad?
My standards for interesting outer alignment failure don’t require the AI to kill us all. I’m ambitious—by “outer alignment,” I mean I want the AI to actually be incentivized do the best things. So to me, it seems totally reasonable to write a post invoking a failure mode that probably wouldn’t kill everyone instantly, merely lead to an AI that doesn’t do what we want.
My model is that if there are alignment failures that leave us neither dead nor disempowered, we’ll just solve them eventually, in similar ways as we solve everything else: through iteration, innovation, and regulation. So, from my perspective, if we’ve found a reward signal that leaves us alive and in charge, we’ve solved the important part of outer alignment. RLHF seems to provide such a reward signal (if you exclude wire-heading issues).
My standards for interesting outer alignment failure don’t require the AI to kill us all. I’m ambitious—by “outer alignment,” I mean I want the AI to actually be incentivized do the best things. So to me, it seems totally reasonable to write a post invoking a failure mode that probably wouldn’t kill everyone instantly, merely lead to an AI that doesn’t do what we want.
My model is that if there are alignment failures that leave us neither dead nor disempowered, we’ll just solve them eventually, in similar ways as we solve everything else: through iteration, innovation, and regulation. So, from my perspective, if we’ve found a reward signal that leaves us alive and in charge, we’ve solved the important part of outer alignment. RLHF seems to provide such a reward signal (if you exclude wire-heading issues).