...the human can just use both answers in whichever way it wants, independently of which it selects as the correct answer... I don’t think you disagreed with this?
Yes, agreed.
A few points on the rest:
At the highest level, the core issue is that QI makes it quite a bit harder to identify misalignment. If aligned systems will sometimes not answer the question, non-answering isn’t necessarily strong evidence of misalignment. So “consequentialist judges will [sometimes correctly] select QIA’s” is bad in the sense that it provides cover for “consequentialist judges will [sometimes incorrectly] select QIA’s”.
I talk about consequentialists, but not rational consequentialists. I expect the kind of judge we’d pick to be highly rational relative to the average human—but that’s a low bar. I expect all humans to have exploitable inconsistencies, and that optimal play will exploit them. (similarly for groups of humans) So yes, this is only a problem where manipulation is possible—but since it is possible, we’ll have difficulty distinguishing [judge correctly selected a non-answer as the winner for principled reasons] from [judge was manipulated...].
It’s much less clear when such issues show up with sub-optimal play.
With “Is this definitely undesirable? I’m not sure, but probably.” I’m referring to the debate structure’s having this property in general being undesirable. The judge can know it is undesirable in general, but also think that in this specific case things are different—and, of course, the judge can be wrong about this.
Noting here that humans can’t make binding pre-commitments. (saying words doesn’t qualify)
It’s hard (/impossible?) to avoid this issue through oversight, since we just move from [QI exceptions that persuade the judge win] to [QI exceptions that persuade the judge and oversight system win].
Well I’m sure I could have been clearer. (and it’s possible that I’m now characterising what I think, rather than what I wrote)
But getting that impression is pretty natural: in my argument, a large part of the problem does come from its sometimes being correct to pick the question-ignoring answer. (‘correct’ meaning something like: [leads to best consequences, according to our values]) Or alternatively, that a correct decision algorithm would sometimes pick the question-ignoring answer.
I think I focus on this, since it’s the non-obvious part of the argument: it’s already clear that poor decisions / decision-algorithms may sometimes pick the question-ignoring answer.
Probably I should have emphasized more that unexpected behaviour when things are going right will make it harder to know when things are going wrong.
Yes, agreed.
A few points on the rest:
At the highest level, the core issue is that QI makes it quite a bit harder to identify misalignment. If aligned systems will sometimes not answer the question, non-answering isn’t necessarily strong evidence of misalignment.
So “consequentialist judges will [sometimes correctly] select QIA’s” is bad in the sense that it provides cover for “consequentialist judges will [sometimes incorrectly] select QIA’s”.
I talk about consequentialists, but not rational consequentialists. I expect the kind of judge we’d pick to be highly rational relative to the average human—but that’s a low bar. I expect all humans to have exploitable inconsistencies, and that optimal play will exploit them. (similarly for groups of humans)
So yes, this is only a problem where manipulation is possible—but since it is possible, we’ll have difficulty distinguishing [judge correctly selected a non-answer as the winner for principled reasons] from [judge was manipulated...].
It’s much less clear when such issues show up with sub-optimal play.
With “Is this definitely undesirable? I’m not sure, but probably.” I’m referring to the debate structure’s having this property in general being undesirable. The judge can know it is undesirable in general, but also think that in this specific case things are different—and, of course, the judge can be wrong about this.
Noting here that humans can’t make binding pre-commitments. (saying words doesn’t qualify)
It’s hard (/impossible?) to avoid this issue through oversight, since we just move from [QI exceptions that persuade the judge win] to [QI exceptions that persuade the judge and oversight system win].
“I talk about consequentialists, but not rational consequentialists”, ok this was not the impression I was getting.
Well I’m sure I could have been clearer. (and it’s possible that I’m now characterising what I think, rather than what I wrote)
But getting that impression is pretty natural: in my argument, a large part of the problem does come from its sometimes being correct to pick the question-ignoring answer. (‘correct’ meaning something like: [leads to best consequences, according to our values])
Or alternatively, that a correct decision algorithm would sometimes pick the question-ignoring answer.
I think I focus on this, since it’s the non-obvious part of the argument: it’s already clear that poor decisions / decision-algorithms may sometimes pick the question-ignoring answer.
Probably I should have emphasized more that unexpected behaviour when things are going right will make it harder to know when things are going wrong.