Here’s how I’d quickly summarize my problems with this scheme:
Oversight problems:
Overseer doesn’t know: In cases where your unaided humans don’t know whether the AI action is good or bad, they won’t be able to produce feedback which selects for AIs that do good things. This is unfortunate, because we wanted to be able to make AIs that do complicated things that have good outcomes.
Overseer is wrong: In cases where your unaided humans are actively wrong about whether the AI action is good or bad, their feedback will actively select for the AI to deceive the humans.
Catastrophe problems:
Even if the overseer’s feedback was perfect, a model whose strategy is to lie in wait until it has an opportunity to grab power will probably be able to successfully grab power.
This is the common presentation of issues with RLHF, and I find it so confusing.
The “oversight problems” are specific to RLHF; scalable oversight schemes attempt to address these problems.
The “catastrophe problem” is not specific to RLHF (you even say that). This problem may appear with any form of RL (that only rewards/penalises model outputs). In particular, scalable oversight schemes (if they only supervise model outputs) do not address this problem.
So why is RLHF more likely to lead to X-risk than, say, recursive reward modelling? The case for why the “oversight problems” should lead to X-risk is very rarely made. Maybe RLHF is particularly likely to lead to the “catastrophe problem” (compared to, say, RRM), but this case is also very rarely made.
This is the common presentation of issues with RLHF, and I find it so confusing.
The “oversight problems” are specific to RLHF; scalable oversight schemes attempt to address these problems.
The “catastrophe problem” is not specific to RLHF (you even say that). This problem may appear with any form of RL (that only rewards/penalises model outputs). In particular, scalable oversight schemes (if they only supervise model outputs) do not address this problem.
So why is RLHF more likely to lead to X-risk than, say, recursive reward modelling?
The case for why the “oversight problems” should lead to X-risk is very rarely made. Maybe RLHF is particularly likely to lead to the “catastrophe problem” (compared to, say, RRM), but this case is also very rarely made.