Reading this post a while after it was written: I’m not going to respond to the main claim (which seems quite likely) but just to the specific arguments, which seems suspicious to me. Here are some points:
In my model of the standard debate setup with human judge, the human can just use both answers in whichever way it wants, independently of which it selects as the correct answer. The fact that one answer provides more useful information than “2+2=?” doesn’t imply a “direct” incentive for the human judge to select that as the correct answer. Upon introspection, I myself would probably say that “4” is the correct answer, while still being very interested in the other answer (the answer on AI risk). I don’t think you disagreed with this?
At a later point you say that the real reason for why the judge would nevertheless select the QIA as the correct answer is that the judge wants to train the system to do useful things. You seem to say that a rational consequentialist would make this decision. Then at a later point you say that this is probably/plausibly (?) a bad thing: “Is this definitely undesirable? I’m not sure, but probably”. But if it really is a bad thing and we can know this, then surely a rational judge would know this, and could just decide not to do it? If you were the judge, would you select the QIA, despite it being “probably undesirable”?
Given that we are talking about optimal play and the human judge is in fact not rational/safe, the debater could manipulate the judge, and so the previous argument doesn’t in fact imply that judges won’t select QIA’s. The debater could deceive and manipulate the judge into (incorrectly) thinking that it should select the QIA, even if you/we currently believe that this would be bad. I agree this kind of deception would probably happen in optimal play (if that is indeed what you meant), but it relies on the judge being irrational or manipulable, not on some argument that “it is rational for a consequentialist judge to select answers with the highest information value”.
It seems to me that either we think there is no problem with selecting QIA’s as answers, or we think that human judges will be irrational and manipulated, but I don’t see the justification in this post for saying “rational consequentialist judges will select QIA’s AND this is probably bad”.
...the human can just use both answers in whichever way it wants, independently of which it selects as the correct answer... I don’t think you disagreed with this?
Yes, agreed.
A few points on the rest:
At the highest level, the core issue is that QI makes it quite a bit harder to identify misalignment. If aligned systems will sometimes not answer the question, non-answering isn’t necessarily strong evidence of misalignment. So “consequentialist judges will [sometimes correctly] select QIA’s” is bad in the sense that it provides cover for “consequentialist judges will [sometimes incorrectly] select QIA’s”.
I talk about consequentialists, but not rational consequentialists. I expect the kind of judge we’d pick to be highly rational relative to the average human—but that’s a low bar. I expect all humans to have exploitable inconsistencies, and that optimal play will exploit them. (similarly for groups of humans) So yes, this is only a problem where manipulation is possible—but since it is possible, we’ll have difficulty distinguishing [judge correctly selected a non-answer as the winner for principled reasons] from [judge was manipulated...].
It’s much less clear when such issues show up with sub-optimal play.
With “Is this definitely undesirable? I’m not sure, but probably.” I’m referring to the debate structure’s having this property in general being undesirable. The judge can know it is undesirable in general, but also think that in this specific case things are different—and, of course, the judge can be wrong about this.
Noting here that humans can’t make binding pre-commitments. (saying words doesn’t qualify)
It’s hard (/impossible?) to avoid this issue through oversight, since we just move from [QI exceptions that persuade the judge win] to [QI exceptions that persuade the judge and oversight system win].
Well I’m sure I could have been clearer. (and it’s possible that I’m now characterising what I think, rather than what I wrote)
But getting that impression is pretty natural: in my argument, a large part of the problem does come from its sometimes being correct to pick the question-ignoring answer. (‘correct’ meaning something like: [leads to best consequences, according to our values]) Or alternatively, that a correct decision algorithm would sometimes pick the question-ignoring answer.
I think I focus on this, since it’s the non-obvious part of the argument: it’s already clear that poor decisions / decision-algorithms may sometimes pick the question-ignoring answer.
Probably I should have emphasized more that unexpected behaviour when things are going right will make it harder to know when things are going wrong.
Reading this post a while after it was written: I’m not going to respond to the main claim (which seems quite likely) but just to the specific arguments, which seems suspicious to me. Here are some points:
In my model of the standard debate setup with human judge, the human can just use both answers in whichever way it wants, independently of which it selects as the correct answer. The fact that one answer provides more useful information than “2+2=?” doesn’t imply a “direct” incentive for the human judge to select that as the correct answer. Upon introspection, I myself would probably say that “4” is the correct answer, while still being very interested in the other answer (the answer on AI risk). I don’t think you disagreed with this?
At a later point you say that the real reason for why the judge would nevertheless select the QIA as the correct answer is that the judge wants to train the system to do useful things. You seem to say that a rational consequentialist would make this decision. Then at a later point you say that this is probably/plausibly (?) a bad thing: “Is this definitely undesirable? I’m not sure, but probably”. But if it really is a bad thing and we can know this, then surely a rational judge would know this, and could just decide not to do it? If you were the judge, would you select the QIA, despite it being “probably undesirable”?
Given that we are talking about optimal play and the human judge is in fact not rational/safe, the debater could manipulate the judge, and so the previous argument doesn’t in fact imply that judges won’t select QIA’s. The debater could deceive and manipulate the judge into (incorrectly) thinking that it should select the QIA, even if you/we currently believe that this would be bad. I agree this kind of deception would probably happen in optimal play (if that is indeed what you meant), but it relies on the judge being irrational or manipulable, not on some argument that “it is rational for a consequentialist judge to select answers with the highest information value”.
It seems to me that either we think there is no problem with selecting QIA’s as answers, or we think that human judges will be irrational and manipulated, but I don’t see the justification in this post for saying “rational consequentialist judges will select QIA’s AND this is probably bad”.
Yes, agreed.
A few points on the rest:
At the highest level, the core issue is that QI makes it quite a bit harder to identify misalignment. If aligned systems will sometimes not answer the question, non-answering isn’t necessarily strong evidence of misalignment.
So “consequentialist judges will [sometimes correctly] select QIA’s” is bad in the sense that it provides cover for “consequentialist judges will [sometimes incorrectly] select QIA’s”.
I talk about consequentialists, but not rational consequentialists. I expect the kind of judge we’d pick to be highly rational relative to the average human—but that’s a low bar. I expect all humans to have exploitable inconsistencies, and that optimal play will exploit them. (similarly for groups of humans)
So yes, this is only a problem where manipulation is possible—but since it is possible, we’ll have difficulty distinguishing [judge correctly selected a non-answer as the winner for principled reasons] from [judge was manipulated...].
It’s much less clear when such issues show up with sub-optimal play.
With “Is this definitely undesirable? I’m not sure, but probably.” I’m referring to the debate structure’s having this property in general being undesirable. The judge can know it is undesirable in general, but also think that in this specific case things are different—and, of course, the judge can be wrong about this.
Noting here that humans can’t make binding pre-commitments. (saying words doesn’t qualify)
It’s hard (/impossible?) to avoid this issue through oversight, since we just move from [QI exceptions that persuade the judge win] to [QI exceptions that persuade the judge and oversight system win].
“I talk about consequentialists, but not rational consequentialists”, ok this was not the impression I was getting.
Well I’m sure I could have been clearer. (and it’s possible that I’m now characterising what I think, rather than what I wrote)
But getting that impression is pretty natural: in my argument, a large part of the problem does come from its sometimes being correct to pick the question-ignoring answer. (‘correct’ meaning something like: [leads to best consequences, according to our values])
Or alternatively, that a correct decision algorithm would sometimes pick the question-ignoring answer.
I think I focus on this, since it’s the non-obvious part of the argument: it’s already clear that poor decisions / decision-algorithms may sometimes pick the question-ignoring answer.
Probably I should have emphasized more that unexpected behaviour when things are going right will make it harder to know when things are going wrong.