How does an AI trained with RLHF end up killing everyone, if you assume that wire-heading and inner alignment are solved? Any half-way reasonable method of supervision will discourage “killing everyone”.
A merely half-way reasonable method of supervision will only discourage getting caught killing everyone, is the thing.
In all the examples we have from toy models, the RLHF agent has no option to take over the supervision process. The most adversarial thing it can do is to deceive the human evaluators (while executing an easier, lazier strategy). And it does that sometimes.
If we train an RLHF agent in the real world, the reward model now has the option to accurately learn that actions that physically affect the reward-attribution process are rated in a special way. If it learns that, we are of course boned—the AI will be motivated to take over the reward hardware (even during deployment where the reward hardware does nothing) and tough luck to any humans who get in the way.
But if the reward model doesn’t learn this true fact (maybe we can prevent this by patching the RLHF scheme), then I would agree it probably won’t kill everyone. Instead it would go back to failing by executing plans that deceived human evaluators in training. Though if the training was to do sufficiently impressive and powerful things in the real world, maybe this “accidentally” involves killing humans.
If we train an RLHF agent in the real world, the reward model now has the option to accurately learn that actions that physically affect the reward-attribution process are rated in a special way. If it learns that, we are of course boned—the AI will be motivated to take over the reward hardware (even during deployment where the reward hardware does nothing) and tough luck to any humans who get in the way.
OK, so this is wire-heading, right? Then you agree that it’s the wire-heading behaviours that kills us? But wire-heading (taking control of the channel of reward provision) is not, in any way, specific to RLHF. Any way of providing reward in RL leads to wire-heading. In particular, you’ve said that the problem with RLHF is that “RLHF/IDA/debate all incentivize promoting claims based on what the human finds most convincing and palatable, rather than on what’s true.” How does incentivising palatable claims lead to the RL agent taking control over the reward provision channels? These issues seem largely orthogonal. You could have a perfect reward signal that only incentivizes “true” claims, and you’d still get wire-heading.
So in which way is RLHF particularly bad? If you think that wire-heading is what does us in, why not write a post about how RL, in general, is bad?
I’m not trying to be antagonistic! I do think I probably just don’t get it, and this seems a great opportunity for me finally understand this point :-)
I just don’t know any plausible story for how outer alignment failures kill everyone. Even in Another (outer) alignment failure story, what ultimately does us in, is wire-heading (which I don’t consider an outer alignment problem, because it happens with any possible reward).
But if the reward model doesn’t learn this true fact (maybe we can prevent this by patching the RLHF scheme), then I would agree it probably won’t kill everyone. Instead it would go back to failing by executing plans that deceived human evaluators in training. Though if the training was to do sufficiently impressive and powerful things in the real world, maybe this “accidentally” involves killing humans.
I agree with this. I agree that this failure mode could lead to extinction, but I’d say it’s pretty unlikely. IMO, it’s much more likely that we’ll just eventually spot any such outer alignment issue and fix it eventually (as in the early parts of Another (outer) alignment failure story,)
If you think that wire-heading is what does us in, why not write a post about how RL, in general, is bad?
My standards for interesting outer alignment failure don’t require the AI to kill us all. I’m ambitious—by “outer alignment,” I mean I want the AI to actually be incentivized do the best things. So to me, it seems totally reasonable to write a post invoking a failure mode that probably wouldn’t kill everyone instantly, merely lead to an AI that doesn’t do what we want.
My model is that if there are alignment failures that leave us neither dead nor disempowered, we’ll just solve them eventually, in similar ways as we solve everything else: through iteration, innovation, and regulation. So, from my perspective, if we’ve found a reward signal that leaves us alive and in charge, we’ve solved the important part of outer alignment. RLHF seems to provide such a reward signal (if you exclude wire-heading issues).
How does an AI trained with RLHF end up killing everyone, if you assume that wire-heading and inner alignment are solved? Any half-way reasonable method of supervision will discourage “killing everyone”.
A merely half-way reasonable method of supervision will only discourage getting caught killing everyone, is the thing.
In all the examples we have from toy models, the RLHF agent has no option to take over the supervision process. The most adversarial thing it can do is to deceive the human evaluators (while executing an easier, lazier strategy). And it does that sometimes.
If we train an RLHF agent in the real world, the reward model now has the option to accurately learn that actions that physically affect the reward-attribution process are rated in a special way. If it learns that, we are of course boned—the AI will be motivated to take over the reward hardware (even during deployment where the reward hardware does nothing) and tough luck to any humans who get in the way.
But if the reward model doesn’t learn this true fact (maybe we can prevent this by patching the RLHF scheme), then I would agree it probably won’t kill everyone. Instead it would go back to failing by executing plans that deceived human evaluators in training. Though if the training was to do sufficiently impressive and powerful things in the real world, maybe this “accidentally” involves killing humans.
OK, so this is wire-heading, right? Then you agree that it’s the wire-heading behaviours that kills us? But wire-heading (taking control of the channel of reward provision) is not, in any way, specific to RLHF. Any way of providing reward in RL leads to wire-heading. In particular, you’ve said that the problem with RLHF is that “RLHF/IDA/debate all incentivize promoting claims based on what the human finds most convincing and palatable, rather than on what’s true.” How does incentivising palatable claims lead to the RL agent taking control over the reward provision channels? These issues seem largely orthogonal. You could have a perfect reward signal that only incentivizes “true” claims, and you’d still get wire-heading.
So in which way is RLHF particularly bad? If you think that wire-heading is what does us in, why not write a post about how RL, in general, is bad?
I’m not trying to be antagonistic! I do think I probably just don’t get it, and this seems a great opportunity for me finally understand this point :-)
I just don’t know any plausible story for how outer alignment failures kill everyone. Even in Another (outer) alignment failure story, what ultimately does us in, is wire-heading (which I don’t consider an outer alignment problem, because it happens with any possible reward).
I agree with this. I agree that this failure mode could lead to extinction, but I’d say it’s pretty unlikely. IMO, it’s much more likely that we’ll just eventually spot any such outer alignment issue and fix it eventually (as in the early parts of Another (outer) alignment failure story,)
My standards for interesting outer alignment failure don’t require the AI to kill us all. I’m ambitious—by “outer alignment,” I mean I want the AI to actually be incentivized do the best things. So to me, it seems totally reasonable to write a post invoking a failure mode that probably wouldn’t kill everyone instantly, merely lead to an AI that doesn’t do what we want.
My model is that if there are alignment failures that leave us neither dead nor disempowered, we’ll just solve them eventually, in similar ways as we solve everything else: through iteration, innovation, and regulation. So, from my perspective, if we’ve found a reward signal that leaves us alive and in charge, we’ve solved the important part of outer alignment. RLHF seems to provide such a reward signal (if you exclude wire-heading issues).